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Preface 


The purpose of this book is to present a comprehensive account of 
sampling theory as it has been developed for use in sample surveys, with 
illustrations to show how the theory is applied in practice and with a supply 
of exercises to be worked by the student. My hope is that the book will be 
useful both as a text for a course on sample surveys in which the major 
emphasis is on theory and for individual reading by the student who does 
not have access to formal instruction. 

The minimum mathematical equipment necessary for an easy under- 
standing of the proofs is a knowledge of differential calculus as far as the 
determination of maxima and minima (using Lagrange multipliers where 
required), plus a familiarity with elementary algebra and especially with 
ihe handling of relatively complicated algebraic summations. Knowledge 
of the laws of probability for finite sample spaces, including combinatorial 
probabilities, the properties of expected values, and conditional probability, 
is extremely helpful. On the statistical side, the book presupposes an 
introductory course which covers such topics as means and standard 
deviations, the normal, binomial, and multinomial distributions, confidence 
limits, Student's t-test, linear regression, and the simpler types of analysis 
of variance. Occasionally, more advarced results from statistics are used, 
since I have tried to point out the relation between sample survey theory 
and the main stream of statistical theory. In the early parts of the book 
each step in a proof should be readily apparent from the previous steps: 
towards the end, where proofs are more condensed, a little work with paper 
and pencil may be necessary to follow some of the steps in detail. 

The order of presentation of topics in this edition is essentially the same 
as in the first edition, most theorems retaining their old numbers. Chapter 
5 on stratification, which was already undesirably long in the first edition, 
has, however, been split into two chapters. The present Chapter 5 contains 
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the older, standard results. A new Chapter SA is devoted to the numerous 
specialized topics that are necessary for the most efficient use of stratifica- 
tion. A further change in order is that an introduction to ratio estimates 's 
now included in Chapters 2 and 3 instead of being postponed to Chapter 6. 
This change was made because ratio estimates are found in practice, often 
in disguised form, even in the simplest types of survey, so that an early 
introduction to them seemed advisable. The teacher who prefers to post- 
pone this subject until Chapter 6 may, of course, continue to do so. 

Many sections have been added at appropriate places to cover results 
published since the writing of the first edition in 1951-1952. Since these 
developments are quite miscellaneous, the new sections are indicated by an 
asterisk in the table of contents. Some of the major topics are as follows. 
Several sections are devoted to the statistical methods that apply when 
survey results are to be presented separately for specified subdivisions of 
the population (for example, persons of different ages or house owners 
and renters) and when comparisons among these subdivisions are wanted 
for analytical reasons. In stratified sampling newer work deals with the 
construction of strata and the choice of number of strata, with the optimum 
sample sizes in individual strata when specified levels of precision are to be 
attained for each of several variables and with two-way stratification when 
the sample is small. A summary is given of the extensive recent researches 
on sampling without replacement when primary units are selected with 
unequal probabilities. The study of nonsampling errors has produced new 
methods of investigating the effectiveness of call-backs, as compared with 
other techniques for reducing the bias due to nonresponse, and new methods 
of gathering data that throw light on the contribution of errors of measure- 
ment to the total errors in estimates made from surveys. 

Some of the new sections are intended to fill gaps in presentation that 
were pointed out by teachers who have used this book or were suggested 
by my own experience. In two-stage sampling, for instance, the variance 
formulas are given separately for each of the principal methods of sample 
selection and estimation, and the efficiency of self-weighting estimates is 
discussed in greater detail. Although these formulas can be deduced from 
one or two general theorems, the convenience of having the explicit results 
Eo bees appeared worth the extra space required. Many old 
Bae Ane xA A include new developments or to clarify the 

The present edition is about easiest zee endoubled, : j 

ne third larger than the first. I view this 


growth with mixed feelings. Even in the fir ition i i 
f St edition it 
cover all the material in a on E 


However, the sections have been prepared so that many can be omitted, or 
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condensed to a brief statement of results, without creating difficulties in 
reading later parts of the book. Although the choice of topics for dis- 
cussion will be governed by the views of the teacher and by the level of 
preparation and fields of application of the students, the following sugges- 
tions are made of sections that may be omitted or condensed in an intro- 
ductory course: 2.8, 2.13, 2.14; 3.9, 3.11; 4.6, 4.7; 5.8, 5.9; 5A.1, 5A.3, 
5A.4, 5A.5, 54.10, 5A.12; 6.4, 6.5, 6.9, 6.13, 6.14, 6.15, 6.17; 7.4, 7.5, 7.7, 
7.8, 7.9; 8.5, 8.6, 8.8, 8.11, 8.12; 9.5, 9.6, 9.11, 9.12, 9.13; 10.7, 10:8, 10.9, 
10.10; 11.7, 11.9, 11.16; 12.5, 12.6, 12.7, 12.8; 13.5, 13.7, 13.15, 13.16. 

Dr. Alva L. Finkner and Dr. Emil H. Jebe prepared a large part of the 
lecture notes from which the first edition was written, and Dr. F. C. Cornell, 
Dr. J. A. Doull, and Dr. Finkner kindly gave their permission to 
quote data from surveys. Some investigations, both theoretical and 
applied, that served as background material were made possible by a 
research contract with the Office of Naval Research, U. S. Navy Depart- 
ment. In the preparation of the present edition the secretarial staff of the 
Department of Statistics, Harvard University, performed nobly in typing, 
and generous assistance in proofreading was given by Mrs. Cleo Youtz and 
by any graduate students who had the ill-fortune to pass my office door 
during the critical time-period. The author index was prepared by Miss 
Susan Rogers and the answers to exercises checked by Mr. P. S. R. 
Sambasiva Rao. I have received much stimulating discussion of recent 
trends in sampling from colleagues Lyle D. Calvin, R. M. Cyert, W. 
Edwards Deming, Tore Dalenius, Morris H. Hansen, Herman O. Hartley, 
William N. Hurwitz, Leslie Kish, William G. Madow, and Frederick F. 
Stephan. To all these persons, named or unnamed, I would like to 


express my thanks. 


Cambridge, Massachusetts WILLIAM G. COCHRAN 
November 1962 
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CHAPTER 1 


Introduction 


1.1 ADVANTAGES OF THE SAMPLING METHOD 


Our knowledge, our attitudes, and our actions are based to a very large 
extent on samples. This is equally true in everyday life and in scientific 
research. A person's opinion of an institution that conducts thousands of 
transactions every day is often determined by the one or two encounters 
which he has had with the institution in the course of several years. The 
traveler who spends 10 days in a foreign country and then proceeds to 
write a book telling the inhabitants how to revive their industries, reform 
their political system, balance their budget, and improve the food in their 
hotels is a familiar figure of fun. But in a real sense he differs from the 
political scientist who devotes 20 years to living and studying in the country 
only in that he bases his conclusions on a much smaller sample of ex- 
perience and is less likely to be aware of the extent of his ignorance. In 
Science and human affairs alike we lack the resources to study more than a 
fragment of the phenomena that might advance our knowledge. 

Until the last 30 years little attention was given to the problems of how 
to obtain a good sample and how to draw sound conclusions from the 
results. This does not matter so long as the material from which we are 
sampling is uniform, so that any kind of sample gives almost the same 
results. Laboratory diagnoses about the state of our health are made from 
a few drops of blood. This procedure is based on the assumption that the 
circulating blood is always well mixed and that one drop tells the same 
Story as another—an assumption which we as laymen fervently hope is 
correct. But when the material is far from uniform, as is often the case, the 
method by which the sample is obtained is critical, and the study of tech- 
niques that ensure a trustworthy sample becomes important. 

This book contains an account of the body of theory that has been built 
up to provide a background for good. sampling methods. In most of the 
applications for which this theory was constructed, the aggregate about 
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2 SAMPLING TECHNIQUES 


which information is desired is finite and delimited—the inhabitants ofa 
town, the machines in a factory, the fish in a lake. In some cases it may 
seem feasible to obtain the information by taking a complete enumeration 
or census of the aggregate. Administrators accustomed to dealing with 
censuses were at first inclined to be suspicious of samples and reluctant to 
use them in place of censuses. Although this attitude no longer persists, 


it may be well to list the principal advantages of sampling as compared 
with complete enumeration. 


Reduced Cost 


If data are secured from only a small fraction of the aggregate, expend- 
itures are smaller than if a complete census is attempted. With large 
populations, results accurate enough to. be useful can be obtained from 
samples that represent only a small fraction of the population. In the 
United States the most important recurrent Surveys taken by the govern- 
ment use samples of around 100,000 persons, or about one person in 
1800. Surveys used to provide facts bearing on sales and advertising 


policy in market research may employ samples of only a few thou- 
sand. 


Greater Speed 


For the same reason, the data can be collected and summarized more 
quickly with a sample than with a complete count. This is a vital consid- 
eration when the information is urgently needed, 

Greater Scope 


In certain types of. inquiry highly trained Personnel or specialized equip- 


ment, limited in availability, must be used to Obtain the data, A complete 
census is impracticable: the choice lies between obtaining the information 
by sampling or not at all. Thus Surveys which rely on sampling have more 
Scope and flexibility regarding the types of information that can be 
obtained. On the other hand, if accurate information is wanted for many 
subdivisions of the population, the size of sample needed to do the job is 


Sometimes so large that a complete enumeration offers the best sol- 
ution. 


Greater Accuracy 


Because personnel of higher quality 
training and because more careful su 
cessing of results becomes feasible When the volume of work is reduced, a 


sample may actually produce more accurate results than the kind of com- 
plete enumeration that can be taken. 


can be employed and given intensive 
pervision of the field work and pro- 
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12 SOME USES OF SAMPLE SURVEYS 


To an observer of developments in sampling over the last 10 years the 
most striking feature is the rapid increase in the number and types of 
surveys taken by sampling. The Statistical Office of the United Nations 
publishes reports from time to time on "Sample Surveys of Current 
Interest” conducted by member countries. The 1960 report lists surveys 
from 52 countries. Many of these surveys seek information of obvious 
importance to national planning on such topics as agricultural production 
and land use, unemployment and the size of the labor force, industrial 
production, wholesale and retail prices, health status of the people, and 
family incomes and expenditures. But more specialized inquiries can also 
be found: for example, housing and social problems of old people 
(Austria), rural debt (Ceylon), the cost of housebuilding (Czechoslovakia), 
the ages of elementary-school pupils (Italy), the effects of television on 
school children (Netherlands), the domestic working conditions of house- 
wives (Sweden), the characteristics and recruitment of foster mothers 
(United Kingdom), the use of technical information by small industry 
(United Kingdom), and the employment of scientists and engineers by 
industry (United States). 

Sampling has come to play a prominent part in national decennial 
censuses. In the United States a 5% sample was introduced into the 1940 
Census by asking extra questions about occupation, parentage, fertility, 
etc., of those persons whose names fell on two of the 40 lines on each page 
ofthe schedule. The use of sampling was greatly extended in 1950. Froma 
20% sample (every fifth line) information was obtained on items such as 
income, years in school, migration, and service in armed forces. By taking 
every sixth person in the 20% sample, a further sample of 3$ % was created 
to give information on marriage and fertility. A series of questions dealing 
with the condition and age of housing was split into five sets, each set being 

filled in at every fifth house. Sampling was also employed to speed up 

publication of the results. Preliminary tabulations for many important 
items, made on a sample basis, appeared more than a year and a half 
before the final reports. an Ti 

This process continued in the 1960 Census. Except for certain basic 
information required from every person for constitutional or legal reasons, 
the whole census was shifted to a 25% sample basis, only one household in 
four receiving the complete schedule. This change, accompanied, by 
greatly increased mechanization, resulted in much earlier publication and 
substantial savings. e 

On a smaller scale, local governments—city, state, and county—are 
making increased use of sample surveys to obtain information needed for 
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future planning and for meeting pressing problems. In the United States 
most large cities have commercial agencies that make a business of plan- 
ning and conducting sample surveys for clients. . 

The operation known as market research is heavily dependent on the 
sampling approach. Estimates of the sizes of television and radio audi- 
ences for different programs and of newspaper and magazine readership 
(including the advertisements) are kept continually under scrutiny. Manu- 
facturers and retailers want to know the reactions of people to new prod- 
ucts or new methods of packaging, their complaints about old products, 
and their reasons for preferring one product to another. 

Business and industry have many uses for sampling in attempting to 
increase the efficiency of their internal operations. The important areas of 
quality control and acceptance sampling are outside the scope of this book. 
But, obviously, decisions taken with respect to level or change of quality 
or to acceptance or rejection of batches are well grounded only if results 
obtained from the sample data are valid (within a reasonable tolerance) for 
the whole batch. The sampling of records of business transactions (ac- 
counts, payrolls, stock, personnel)—usually much easier than the sampling 
of people—can provide serviceable information quickly and economically, 
Savings can also be made through sampling in the estimation of inven- 
tories, in studies of the condition and length of the life of equipment, in the 
inspection of the accuracy and rate of output of clerical work, in investi- 
gating how key personnel distribute their working time among different 
tasks, and, more generally, in the new field known as operations research. 
The books by Deming (1960) and Slonim (1960) contain many interesting 


examples showing the range of applications of the sampling method in 
business, 


Opinion, attitude, and election 
nique of sampling before the pub 
of newspapers. In the field of 
ployed sampling for many yeai 
modern developments to the pa 
of sample surveys as evidence 
discussion. 

Sample surveys can be classified broadl 
analytical. In a descriptive survey the objective is simply to obtain certain 
information about large groups: for example, the numbers of men, women, 
and children who view a television program. In an analytical survey, com- 
parisons are made between different subgroups of the population, in order 
to discover whether differences exist among them that may enable us to 
form or to verify hypotheses about the forces at work in the population. 
The Indianapolis fertility survey, for instance, was an attempt to determine 


polls, which did much to bring the tech- 
lic eye, continue to be a popular feature 
accounting and auditing, which has em- 


rticular problems of this field. The status 
in lawsuits has also been subject to lively 


y into two types—descriptive and 


TS, a new interest has arisen in adapting: 
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the extent to which married couples plan the number and spacing of 
children, the husband’s and wife’s attitudes toward this planning, the 
reasons for these attitudes, and the degree of success attained (Kiser and 
Whelpton, 1953). 

The distinction between descriptive and analytical surveys is not, of 
course, clear-cut. Many surveys provide data that serve both purposes. 
Along with the rise in the number of descriptive surveys, there has, how- 
ever, been a noticeable increase in surveys taken primarily for analytical 
purposes, particularly in the study of human behavior and health. Surveys 
of the teeth of school children before and after fluoridation of water, of 
the death rates and causes of death of people who smoke different amounts, 
and the huge study of the effectiveness of the Salk polio vaccine may be 
cited. 

The success of the sample survey has led to its employment in estimating 
some unusual items: for example, the lengths of cigarette butts, the num- 
ber of flies in a town, the number of signatures to a petition that were not 
actually written by the persons whose names appear, and the number of 
people who can fold their tongues. T hese items were relevant, respectively, 
to studies of the relation of lung cancer and smoking, of the effectiveness of 
fly spraying, of the legality of a petition, and of the inheritance of tongue 
folding—although the last item has not, to my knowledge, been the sub- 
ject of an extensive survey. 


1.3 THE PRINCIPAL STEPS IN A SAMPLE SURVEY 


As a preliminary to a discussion of the role that theory plays in a sample 
survey, it is useful to describe briefly the steps involved in the planning and 
execution of a survey. Surveys vary greatly in their complexity. To take a 
sample from 5000 cards, neatly arranged and numbered in a file, is an easy 
task. It is another matter to sample the inhabitants of a region where 
transport is by water through the forests, where there are no maps, where 
15 different dialects are spoken, and where the inhabitants are very sus- 
picious of an inquisitive stranger. Problems that are baffling in one survey 
may be trivial or nonexistent in another. FoU 

The principal steps in a survey are grouped somewhat arbitrarily under 
11 headings. 


Objectives of the Survey 

A lucid statement of the objectives is most helpful. Without this, it is 
easy in a complex survey to forget the objectives when engrossed in the 
details of planning, and to make decisions that are at variance with the 


objectives. 
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Population to be Sampled 

The word population is used to denote the aggregate from which the 
sample is chosen. The definition of the population may present no prob- 
lem, as when sampling a batch of electric light bulbs in order to estimate 


ampling a population of farms, on 
the other hand, rules must be Set up to define a farm, and borderline cases 


Data to be Collected 


It is well to verify that all the data are relevant to the purposes of the 
Survey and that no essential data are omitted. There is frequently a tend- 
ency, particularly with human populations, to ask too many questions, 
Some of which are never subsequently analyzed. An overlong question- 
naire lowers the quality of the answers to important as well as unimpor- 
tant questions, 

Degree of Precision Desired 

The results of sample surve 
because only part of the pop 


usually costs t 

of Precision wanted in the results is an important step. This step is the 

responsibility f thi 

aa since many administrators are unaccustomed to thinking in 

tare o e amo fe be tolerated in estimates, consistent 
making good q, ns. The statistician can often help at this stage. 


[mir D à choice of measuring instrument and of method of 
© population. Data about a person’s State of health May be 
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obtained from statements that he makes or from a medical examination. 
The survey may employ a self-administered questionnaire, an interviewer 
who reads a standard set of questions with no discretion, or an interviewing 
process that allows much latitude in the form and ordering of the questions. 
The approach may be by mail, by telephone, by personal visit, or by a 
combination of the three. Much study has been made of interviewing 
methods and problems [see, e.g., Hyman (1954) and Payne (1951)]. 

A major part of the preliminary work is the construction of record forms 
on which the questions and answers are to be entered. With simple 
questionnaires, the answers can sometimes be precoded—that is, entered 
in a manner in which they can be routinely transferred to mechanical 
equipment. In fact, for the construction of good record forms, it is neces- 
sary to visualize the structure of the final summary tables that will be used 


for drawing conclusions. 


The Frame 

Before selecting the sample, the population must be divided into parts 
which are called sampling units, or units. These units must cover the whole 
of the population and they must not overlap, in the sense that every element 
in the population belongs to one and only one unit. Sometimes the 
appropriate unit is obvious, as in a population of light bulbs, in which the 
unit is the single bulb. Sometimes there is a choice of unit. In sampling 
the people in a town, the unit might be an individual person, the members 
of a family, or all persons living in the same city block. In sampling an 
agricultural crop, the unit might be a field, a farm, or an area of land whose 
Shape and dimensions are at our disposal. $ 

The construction of this list of sampling units, called a frame, is often one 
of the major practical problems. From bitter experience, samplers have 
acquired a critical attitude toward lists that have been routinely collected 
for some purpose. Despite assurances to the contrary, such lists are often 
found to be incomplete, or partly illegible, or to contain an unknown 
amount of duplication. A good frame may be hard to come by when the 
Population is specialized, as in populations of bookmakers or of people 
who keep turkeys. Jessen (1955) presents an interesting method of con- 
structing a frame from the branches of a fruit tree. 


Selection of the Sample 

There is now a variety of plans by 
For each plan that is considered, rough e 
be made from a knowledge of the degree of 
Costs and time involved for each plan are a 
decision. 


which the sample may be selected. 
h estimates of the size of sample can 
precision desired. The relative 
iso compared before making a 
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The Pretest 


It has been found useful to try out the questionnaire and the field meth- 
ods on a small scale. This nearly always results in improvements in the 
questionnaire and May reveal other troubles that will be serious on a large 
scale, for example, that the cost will be much greater than expected. 


Organization of the Field Work 


In extensive surveys many problems of business administration are met. 
The personnel must receive training in the purpose of the survey and in the 
methods of measurement to be employed and must be adequately super- 
vised in their work. A procedure for early checking of the quality of the 
returns is invaluable. Plans must be made for handling nonresponse, 


that is, the failure of the enumerator to obtain information from certain of 
the units in the sample. 


Summary and Analysis of the Data 


The first step is to edit the completed questionnaires, in the hope of 
amending recording errors, or at least of deleting data that are obviously 
erroneous. Decisions about tabulating procedure are needed in cases in 
which answers to certain questions were omitted by some respondents or 
were deleted in the editing process, Thereafter, the tabulations which lead 


to the estimates are performed. Different methods of estimation may be 
available for the same data. 


In the presentation of results i 


t is good practice to report the amount of 
error to be expected in the most 


important estimates. One of the advan- 
tages of probability sampling is that such statements can be made, al- 
though they have to be severely qualified if the amount of nonresponse is 
substantial, 


: à give accurate estimates. Any completed 
sample is potentially a guide to improved future sampling, in the data that 
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14 THE ROLE OF SAMPLING THEORY 


This list of the steps in a sample survey has been given in order to 
emphasize that sampling is a practical business, which calls for several 
different types of skill. In some of the steps—the definition of the popu- 
lation, the determination of the data to be collected and of the methods 
of measurement, and the organization of the field work—-sampling theory 
plays at most a minor role. Although these topics are not discussed further 
in this book, their importance should be realized. Sampling demands 
attention to all phases of the activity: poor work in one phase may ruin a 
survey in which everything else is done well. 

The purpose of sampling theory is to make sampling more efficient. It 
attempts to develop methods of sample setection and of estimation that 
provide, at the lowest possible cost, estimates that are precise enough for 
our purpose. This principle of specified precision at minimum cost recurs 
repeatedly in the presentation of theory. 

In order to apply this principle, we must be able to predict, for any 
sampling procedure that is under consideration, the precision and the 
cost to be expected. So far as precision is concerned, we cannot foretell 
exactly how large an error will be present in an estimate in any specific 
situation, for this would require a knowledge of the true value for the 
population. Instead, the precision of a sampling procedure is judged by 
examining the frequency distribution generated for the estimate if the pro- 
cedure is applied again and again to the same population. This is, of 
course, the standard technique by which precision is judged in statistical 
theory. 


A further simplification is introduce: 
are common in practice, there is often good reason to suppose that the 


sample estimates are approximately normally distributed. With a normally 
distributed estimate, the whole shape of the frequency distribution is 
known if we know the mean and the standard deviation (or the variance). 
A considerable part of sample survey theory is concerned with finding 
formulas for these means and variances. Y f 

One difference between sample survey theory and the classical theory 
of sampling is that the populations in survey work contain a finite number 
ofunits. The methods used to prove theorems are different, and the results 
are slightly more complicated, when sampling is from a finite instead ofan 
infinite population. For practical purposes these differences in results for 
finite and infinite populations are seldom important. Whenever the size of 
the sample is small (in terms of the number of primary sampling units) 
relative to the size of the population, results derived from an infinite popu- 
lation are fully adequate. In general, results for finite populations are 


d. With samples of the sizes that 
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presented in this book. In some of the more difficult problems, the theory 
for infinite populations is used to simplify the presentation. 


L5 PROBABILITY SAMPLING 


All sampling procedures for which a theory has been developed have 
the following mathematical properties in common. 


1. We are able to define the set of distinct samples, Sj, $,,*::, Sy 
which the procedure is capable of selecting if applied to a specific popu- 
lation. This means that we can say precisely what sampling units belong 
to S, to S,, and so on. For example, suppose that the population con- 
tains six units, numbered 1 to 6. A common procedure for choosing a 
sample of size 2 gives three possible candidates—S, ~ (1, 4); Sa ~ (2, 5); 
S3 ~ (3, 6). Note that not all possible samples of size 2 need be included. 

2. Fach possible sample S; has assigned to it a known probability of 
selection z;. 

3. We select one of the S, by a process in which each S, receives its 
appropriate probability m; of being selected. In the example we might 
assign equal probabilities to the three samples. Then the draw itself can be 
made by choosing a random number between 1 and 3. If this number is 
J, S, is the sample that is taken. 

4. The method for computing the estimate from the sample must be 
Stated and must lead to a unique estimate for any specific sample. We may 


declare, for example, that the estimate is to be the average of the measure- 
ments on the individual units in the sample. 


For any sampling procedure that satisfies these properties, we are in a 
Position to calculate the frequency distribution of the estimates it generates 


if repeatedly applied to the same Population. For we know how frequently 
any particular sample S; wi 


ur « Will be selected, and we know how to calculate the 
ae a: en e in S,. It is clear, therefore, that a sampling pe 
or any pri i s 
development may be wur i som opi p 
The term Probability sa 
of course, not the only w; 
are some common type: 


mpling refers to a Procedure of this type. This is, 


ay in which a Sample can be drawn. The following 
S of nonprobability sampling. 


Cessible. A Sample 
top 6 to 9 inches. 


2. The Sample is selected haphazardly 


large cage in a laborat i In picking ten rabbits from a 
Ory, the invest z 
Test on, without SET Bano Eri quier Duose ibat ee 
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3. With a sinall but heterogeneous population, the sampler inspects the 
whole of it and selects a small sample of “typical” units—that is, units that 
are close to his impression of the average of the population. This method 
is sometimes called judgment or purposive selection. 

4. The sample consists essentially of, volunteers, in studies in which 
the measuring process is unpleasant or troublesome to the person being 
measured. 


Under the right conditions, any of these methods can give useful results. 
They are not, however, amenable to the development of a sampling theory, 
since no element of random selection is involved. About the only way of 
examining how good one of them may be is to find a situation in which the 
results are known, either for the whole population or for a probability 
sample, and make comparisons. Even if a method appears to do well in one 
such comparison, this does not guarantee that it will do well under dif- 
ferent circumstances. 

In practice we seldom draw a probability sample by writing down the S; 
and z, as outlined above, This is intolerably laborious with a large popu- 
lation, where a sampling procedure may produce billions of possible 
samples. The draw is most commonly made by specifying probabilities of 
inclusion for the individual units and drawing units, one by one or in groups 
until the sample of desired size and type is constructed. For the purposes 
of a theory it is sufficient to know that we could write down the S; and 
a, if we wanted to and had unlimited time. 


1.6 USE OF THE NORMAL DISTRIBUTION 


As mentioned previously, the samples in surveys are often large enough 
so that an estimate made from them is approximately normally distributed. 
Further, with probability sampling, we have formulas that give the mean 
and variance of the estimate. Consider first unbiased estimates. An esti- 
mate jt given by a sampling plan is called an unbiased estimate of some 
population characteristic x if the mean value of fi, taken over all possible 
samples, is equal to p. In the notation of section 1.5, this condition may 


be written 3 
Ef) =D mille -pu 
= 


is the estimate given by the ith sample. The symbol E, which 
stands for “the expected value of," is used frequently. 
Suppose that we have taken a sample by a procedure known to give 
unbiased estimates and have computed the sample estimate Á and its stand- 
ard deviation c; (often called, alternatively, its standard error). How good 
is the estimate? We cannot know the exact value of the error.of estimate 


where (i; 
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(& — p), but from the properties of the normal curve the chances are 


0.32 (about 1 in 3) that the absolute error | — u| exceeds c; 
0.05 (1 in 20) that the absolute error |f — u| exceeds 1.966; = 20; 
0.01 (1 in 100) that the absolute error |/i — u| exceeds 2.580; 


For example, if a probability sample of the records of batteries in rou- 
tine use in a large factory shows an average life i = 394 days, with a 
standard error o; = 4.6 days, the chances are 99 in 100 that the average 
life in the population of batteries lies between 

fiz = 394 — (2.58)(4.6) = 382 days 
and Êu = 394 + (2.58)(4.6) = 406 days 

The limits, 382 days and 406 days, are called lower and upper con- 
fidence limits. With a single estimate from a single survey, the statement 
“u lies between 382 and 406 days" is not certain to be correct. The “99% 
confidence" figure implies that if the same sampling plan were used many 
times in a population, a confidence statement being made from each sam- 
ple, about 99 % of these statements would be correct and 1 7$ wrong. When 
sampling is being introduced into an operation in which complete cen- 
suses have previously been used, a demonstration of this property is 
sometimes made by drawing repeated samples of the type proposed from a 
population for which complete records exist, so that x is known (see, e.g., 
Trueblood and Cyert, 1957). The practical verification that approximately 
the stated proportion of statements is correct does much to educate and 
reassure administrators about the nature of sampling. Similarly, when a 
single sample is taken from each of a series of different populations, about 
95% of the 95% confidence statements are correct. 

. The preceding discussion assumes that 9,, as computed from the sample, 
is known exactly. Actually, 05, like Á, is subject to a sampling error. With 
a normally distributed variable, tables of Student's t distribution are used 


instead of the normal tables to calculate confidence limits for y when the 
sample is small, 


n Replacement of the normal table by the ¢ table makes 
almost no Gifference if the number of degrees of freedom in a, exceeds 60. 
NU certain types of stratified sampling and with the method of replicated 
wee n 13.14) the degrees of freedom are small and the 1 table is 


17 BIAS AND ITS EFFECTS 
In sample surve: 


y theory it is necessa; ; : s 
two reasons. ry, Ty to consider biased estimates for 


1. Ins 
reni E ue Ek ari Problems, particularly in the estimation 
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to be Blasi. hat are ot erwise convenient and suitable are found 
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04 


Fig. 1.1 Effect of bias on errors of estimation. 


2. Even with estimates that are unbiased in probability sampling, errors 
of measurement and nonresponse may produce biases in the numbers that 
we are able to compute from the data. This happens, for instance, if the 
persons who refuse to be interviewed are almost all opposed to some ex- 
penditure of public funds, whereas those who are interviewed are split 
evenly for and against. 


To examine the effect of bias, suppose that the estimate Á is normally 
distributed about a mean m which is a distance B from the true population 
value y, as shown in Fig. 1.1. The amount of bias is B = m — p. Suppose 
that we do not know that any bias is present. We compute the standard 
deviation o of the frequency distribution of the estimate—this will, of 
course, be the standard deviation about the mean m of the distribution, not 
about the true mean u. We are using o in place of op. Asa statement about 
the accuracy of the estimate, we declare that the probability is 0.05 that the 
estimate ji is in error by more than 1.960. 

We will consider how the presence of bias distorts this probability. To 


do this, we calculate the true probability that the estimate is in error by 


more than 1.960, where error is measured from the true mean u. The two 
tails of the distribution must be examined separately. For the upper tail, 


the probability of an error of more than 4-1.96 is the shaded area above 
Q in Fig. 1.1. This area is given by 
1 j^ g 6e dá 


o [2m Juro 
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Put Á — m = ot. The lower limit of the range of integration for t is 


pun. o0 e 1.96.—2. 
o o 
Thus the area is 


1 2 & 
— Í oa dt 
V27 J196- G0) 


Similarly, the lower tail, that is, the shaded area below P, has an area 
—1.96—(/a) 

DUM i eU dy 

V 2m 

From the form of the integrals it is clear that the amount of disturbance 


depends solely on the ratio of the bias to the standard deviation. The 
results are shown in Table 1.1. 


—o0 


TABLE 1.1 


EFFECT OF A BIAS B ON THE PROBABILITY OF AN ERROR 
GREATER THAN 1.960 


Probability of Error 


Bla < —1.960 »1.96c Total 
C————É Cu E I oag 

0.02 0.0238 0.0262 0.0500 
0.04 0.0228 0.0274 0.0502 
0.06 0.0217 0.0287 0.0504 
0.08 0.0207 0.0301 0.0508 
0.10 0.0197 0.0314 0.0511 
0.20 0.0154 0.0392 0.0546 
0.40 0.0091 0.0594 0.0685 
0.60 0.0052 0.0869 0.0921 
0.80 0.0029 0.1230 0.1259 
1.00 0.0015 0.1685 0.1700 
1.50 0.0003 0.3228 0.3231 


Se OMS E Ure nsu wx. 


For the total probability of an error of more than 1.960, the bias has 
little effect provided that it is less than one-tenth of the standard deviation. 
At this point the total Probability is 0.0511 instead of the 0.05 which we 
think it is. As the bias increases further, the disturbance becomes more 
Serious. At B == c. the total probability of error is 0.17, more than three 
times the presumed value 
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probability of the corresponding overestimate mounts steadily. In most 
applications the total error is the primary interest, but occasionally we are 
particularly interested in errors in one direction. 

As a working rule, the effect of bias on the accuracy of an estimate is 
negligible if the bias is less than one tenth of the standard deviation of the 
estimate. If we have a biased method of estimation for which B/o < 0.1, 
where B is the absolute value of the bias, it can be claimed that the bias is 
not an appreciable disadvantage of the method. Even with B/o = 0.2, the 
disturbance in the probability of error is modest. 

In using these results; a distinction must be made between the two sources 
of bias mentioned at the beginning of this section. With biases of the type 
that arise in estimating ratios, an-upper limit to the ratio B/o can be found 
mathematically. If the sample is large enough, we can be confident that 
Bio will not exceed 0.1. With biases caused by errors of measurement or 
nonresponse, on the other hand, it is usually impossible to find a guaran- 
teed upper limit to B|c that is small. This troublesome problem is dis- 


cussed in Chapter 13. 


18 THE MEAN SQUARE ERROR 


In order to compare a biased estimate with an unbiased estimate, or two 
estimates that have different amounts of bias, a useful criterion is the mean 
square error (MSE) of the estimate, measured from the population value 


that is being estimated. Formally, 


MSE(f) = E(â — p)? = Ela — m) + (m — 9r 
= Efi — mp + 20m — WE — m) + (m — p)? 


= (variance of //) + (bias)? 


the cross-product term vanishing since E(á — m) = 0. — 

Use of the MSE as a criterion of the accuracy of an estimate amounts to 
regarding two estimates that have the same MSE as equivalent. This is not 
strictly correct because the frequency distributions of errors (å — u) of 
different sizes will not be the same for the two estimates if they have 
different amounts of bias. It has been shown, however, by Hansen, 
‘Hurwitz, and Madow (1953) that if B/c is less than about 3 the two fre- 
quency distributions are almost identical in regard to absolute errors 
|â — u| of different sizes. Table 1.2 illustrates this result. rae 

Even at B/o = 0.6, the changes in the probabilities as compare with 


those for B/o = 0 are slight. 
Because of the difficulty of 
estimates, we shall usually spe 


ensuring that no unsuspected bias enters into 
ak of the precision of an estimate rather than 
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TABLE 1.2 


PROBABILITY OF AN ABSOLUTE ERROR > | VMSE, 
1.96V MSE AND 2.576V MSE 


Probability 
Blo IVMSE  1.96VMSE 2.576V MSE 
pt eS ee RU 
0 0.317 0.0500 0.0100 
0.2 0.317 0.0499 0.0100 
0.4 0.319 0.0495 0.0095 
0.6 0.324 0.0479 0.0083 


aaa >” T NN 


its accuracy. Accuracy refers to the size of deviations from the true mean 
H, whereas precision refers to the size of deviations from the mean m 
obtained by repeated application of the sampling procedure. 


EXERCISES 


1.1 Suppose that you were using sampling to estimate the total number of 
Words in a book that contains illustrations. 

(a) Is there any problem of definition of the population? (6) What are the pros 
and cons of (1) the Page, (2) the line, as a sampling unit? 

12 A Sample is to be taken from a list of names that are on cards (one name 
to a card) numbered consecutively in a filé. Fach name is to have an equal 
chance of being drawn in the sample. What problems arise in the following 
common situations? (a) Some 9f the names do not belong to the target popula- 
tion, although this fact cannot be verified for any name until it has been drawn. 
(6) Some names appear on more than one card. All cards with the same name 
MS ausi and therefore appear together in the file. (c) RET 

n more t| ari z 
Scattered anywhere about the file; io dae pest, eee 


the last year. 


14 A city directory, four years old, lis 
$ > » lists th i 
Street, and gives the names of the persons hiss tee ca ong ace 
interview Survey of the people in the City, what are the deficienci i 
t i city, cie 2 
Can they be remedied by the interviewers during the SUEDE thes fold Wonk 


In using the directory, would you draw a li 7 
list of persons? 3 S resses (dwelling-places) or a 


1.5 In estimating by sampling the actual value of th i 
: e sm; i 
inventory of a large firm, the actual and the book value were iuo pne 
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item in the sample. For the total sample, the ratio of actual to book value was 
1.021, this estimate being approximately normally distributed with a standard 
error of 0.0082. If the book value of the inventory is $80,000, compute 95% 
confidence limits for the actual value. 

1.6 Frequently data must be treated as a sample, although at first sight they 
appear to be a complete enumeration. A proprietor of a parking lot finds that 
business is poor on Sunday mornings. After 26 Sundays in operation, his 
average receipts per Sunday morning are exactly $10. The standard error of this 
figure, computed from week-to-week variations, is $1.2. The attendant costs 
$7 each Sunday. The proprietor is willing to keep the lot open at this time if his 
expected future profit is $5 per Sunday morning. What is the confidence 
probability that the long-term profit rate will be at least $5? What assumption 
must be made in order to answer this question? 

1.7 In Table 1.2, what happens to the probability of exceeding 1V MSE, 


1.96 V MSE and 2.576V MSE when B/o tends to infinity, i.e., when the MSE 
is due entirely to bias? Do your results agree with the directions of the changes 
noted in Table 1.2 as B/o moves from 0 to 0.6? 

1.8 When it is necessary to compare two estimates that ‘have different 
frequency distributions of errors (Å — u), it is occasionally possible, in specialized 
problems, to compute the cost or loss that will result from an error (Å — 4) 
of any given size. The estimate that gives the smaller expected loss is preferred, 
other things being equal. Show that if the loss is a quadratic function A(@@ — mu 
of the error, we should choose the estimate with the smaller mean square error. 
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CHAPTER 2 


Simple Random Sampling 


2.1 SIMPLE RANDOM SAMPLING 


Sample surveys deal with samples drawn from populations which con- 
tain a finite number N of units. If these units can all be distinguished from 
one another, the number of distinct samples of size n that can be drawn 
from the N units is given by the combinatorial formula 


A ) = yC, = EET (2.1) 


For example, if the population contains five units denoted by A, B,'C, D, 
and E, there are 10 different samples of size 3, as follows: 


ABC ABD ABE ACD ACE 
ADE BCD BCE BDE CDE 


Note that the same letter is not allowed to occur twice in the sample. No 
attention is paid to the order in which the letters occur in the sample, the 
six samples ABC, ACB, BAC, BCA, CAB, and CBA being considered 
identical. 

Simple random sampling is a method of selecting n units out of the N such 
that every one of the yC, samples has an equal chance of being chosen. 
This type of sampling is sometimes called random sampling. Since the 
word random is used in the literature in many different senses, an extra 


qualifying adjective is advisable. Some writers prefer the phrase unre- 
stricted random sampling. 


In practice a simple random sample is drawn unit by unit. The units in 
the population are numbered from 1 to N. A series of random numbers be- 
tween 1 and Nis then drawn, either by means ofa table of random numbers 
or by placing the numbers 1 to N in a bowl and mixing thoroughly. If 
was >n numbers are drawn out in succession. The units which 
ese numbers constitute the sample. At any stage in the draw, this 
18 
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process gives an equal chance of selection to all numbers not previously 
drawn. It is easy to verify that all yC, possible samples have an equal 
chance. 

When a number has been drawn from the bowl, it is not replaced, since 
this might allow the same unit to enter the sample more than once. For 
this reason the sampling is described as without replacement. Similarly, if a 
table of random numbers is employed, a number that has been drawn 
previously is ignored. Sampling with replacement is entirely feasible but 
except in special circumstances is seldom used, since there seems little point 
in having the same unit twice in the sample. 

Other methods of sampling are often preferable to simple random 
sampling on the grounds of convenience or of increased precision. Simple 
random sampling serves best to introduce sampling theory. 


2.2 DEFINITIONS AND NOTATION 


In a sample survey we decide on certain properties which we attempt to 
measure and record for every unit that comes into the sample. These prop- 
erties of the units are referred to as characteristics or more simply as items. 

The values obtained for any specific item in the N units that comprise the 
population are denoted by yy Y2; `` `, v: The corresponding values for 
the units in the sample are denoted by 43, Y2, * * * > Ym OT, if we wish to refer 
to a typical sample member, by y; (i = 1, 2, °° > n). Note that the sample 
will not consist of the first n units in the population, except in the instance, 
usually rare, in which these units happen to be drawn, If this point is kept 
in mind, my experience has been that no confusion need result. 

Capital letters refer to characteristics of the population and lower case 
letters to those of the sample. For totals and means we have the following 


definitions: 


Population Sample 
N n 
Total: Y-Yy =n ttt Yn Sy = tye to +Y 
n 
EPDESEES EDT . AX t cts Mu 
Mean; Y = N = WN Dea IN e 


nnn eee Eee 
Although sampling is undertaken for many purposes, interest centers 

most frequently on four characteristics of the population. 

e number of children per school). 

umber of acres of wheat in a region). 

R= ¥/X = Y]|X (e.g. ratio of liquid 


1. Mean = Y (e.g., the averag 
2. Total = Y (e.g., the total n 
3. Ratio of two totals or means R = 
assets to total assets in a group of families). 
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4. Proportion of units which fall into some defined class (e.g., propor- 
tion of people with false teeth). 


Estimation of the first three quantities is discussed in this chapter. 
The symbol ~ denotes an estimate of a population characteristic made 
from a sample. In this chapter only the simplest estimates are considered: 


Estimate 


Population mean Y 


ll 
Il 


Y sample mean 
Population total Y f = Nj = NY yn 
k 


Population ratio R 


In f the factor N[n by which the sample total is multiplied is sometimes 
called the expansion or raising or inflation factor. Its inverse n/N, the ratio 
of the size of the sample to that of the population, is called the sampling 
fraction and is denoted by the letter f. 


2.5 PROPERTIES OF THE ESTIMATES 


Its utility is likely to be confi 
of Consistency in a finite p 
(1953), 


AS we have seen, a me Mel 
; thod of estimation j 7 1 
of the estimate, taken over all ibis Eri i the average value 
equal to the try, 


eei A 
Doi Ge [6-99 

A o ~ 
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= vae Dh E 
qualification, this result must hold for any population of finite values y; 
and for any n. To investigate whether g is unbiased with simple random 
sampling, we calculate the value of j for all nC, samples and find the 
average of the estimates. The symbol E denotes this average over all 
possible samples. 


Theorem 2.1. The sample mean j is an unbiased estimate of X 
Proof. By its definition 
à pg24 2X A x.) 
AC. n[N!/n! (N — n)!] 


where the sum extends over all «C, samples. To evaluate this sum, we 
find out in how many samples any specific value y; appears. Since there 
are (N — 1) other units available for the rest of the sample and (n — 1) 
other places to fill in the sample, the number of samples containing y; is 


(2.2) 


___(N—1)! 
i CT TESTA e 
Hence 
= ! 
Etat rm Lc feo a) 


From (2.2) this gives 
(N — 1)! n! (N —n)! d 
"=N m ny Htt: n) 
2 tutu) y 
N 
Corollary. Y = Ny is an unbiased estimate of the population total Y. 


A less cumbersome proof of theorem 2.1 is obtained as follows. Since 
every unit appears in the same number of samples, it is clear that 


Ej = 


(2.4) 


E(Jy + ya ccc d y,) must be some multiple of y, + y; +--+ 4- UN. 
: (2.5) 


The multiplier must be n/N, since the expression on the left has n terms and 
that on the right has N terms. This leads to the result. 


n 
EX OF THE ESTIMATES 
E Af OMARTANGES 
The Variance of they; à nite population is usually defined as 
exp, \ OV 


f. \ a\ N e AAN 
/ aa) Si Du YE Pgh ©: 
dà Mi J V. 
I-3— ew 09 
* 4 
eh 
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As a matter of notation, results are presented in terms of a slightly differ- 
ent expression, in which the divisor (N — 1) is used instead of N. We 
take 


N =, 
2% — Y): 
NUNC 


S? 


(2.7) 


This convention has been used by those who approach sampling theory by 
means of the analysis of variance, Its advantage is that most results take a 
slightly simpler form. Provided that the same notation is maintained con- 
sistently, all results are equivalent in either notation. 


We now consider the variance of j. By this we mean E(y — Y)? taken 
over all yC, samples. 


1 Theorem 2.2. The variance of the mean 


y from a simple random sample 
is 


?(N — 
VQ-sg-rp-90—9.940.p5 ag) 
where f = n/N is the sampling fraction. 
Proof. 


OG) E 9) (2.9) 
By the argument of symmetry used in relation (2.5), it follows that 


Eo, — Y+- tu,- yy = 5 Mn Yr HUn Y] (2.10) 
and also that 


Fs, — Yue — Y) y, Y j+ + Yn-1— Fiun — Y)] 
= ae E Y)(y,— Y) + (9 — Y)(ys = Y) 


tcc (Yxa — Y(yy — Y). (2.11) 
In (2.11) the sums of 


J products extend over all pairs of units in the sample 
and population, respectively. The sum on the left contains n(n — 1)/2 terms 
and that on the right N(N 


— 1)/2 terms. 
Now square (2.9) and avera, 
(2.10) and (2.11), we obtain 


8° over all simple random samples. Using 
"EG — yp — 2 Y = 

WEG Yy nuc Deu va 
2(n — 1) = v. z E 

Mor P ar vare 2 
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Completing the square on the cross-product term, we have 


ney — Y? - 5 (i - 2A + os - 098 


N-1 
The second term inside the curly bracket vanishes, since the sum of the 
y, equals NY. Division by n? gives 


-P+ + (ow P] 


pve pr da eura: Nes _ yg. Sj (N-n) 
VG) = EG — Y = gre N X (y — YF = TUE 
Corollary 1. The standard error of 7 is 
oy = ^ JO =ni = ^ JA (2.12) 


Corollary 2. The variance of  — Ng, as an estimate of the population 
total Y, is 


VP) = BP — vp = XSW 0. NS aay 
Corollary 3. The standard error of Ý is 

op = 13 TEN my = 8 gp; 2.14 

D = R= JN ui (2.14) 


2.5 THE FINITE POPULATION CORRECTION 


For a random sample of size n from an infinite population, it is well 
known that the variance of the mean is o?/n. The only change in this result 
when the population is finite is the introduction of the factor (N — n)/N. 
The factors (N — n)/N for the variance and V (N — n)/N for the standard 
error are called the finite population corrections (fpc). They are given 
with a divisor (N — 1) in place of N by writers who present results in 
terms of c. Provided that the sampling fraction n/N remains low, these 
factors are close to unity, and the size of the population as such has no 
direct effect on the standard error of the sample mean. For instance, if S 
is the same in the two populations, a sample of 500 from a population of 
200,000 gives almost as precise an estimate of the population mean as a 
sample of 500 from a population of 10,000. Persons unfamiliar with 
sampling often find this result very difficult to believe, and indeed it is 
remarkable. To them it seems intuitively obvious that, if information has 
been obtained about only a very small fraction of thepopulation, the sample 
mean just cannot be accurate. It is instructive for the reader to consider 


Why this point of view is erroneous. 
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In practice the fpc can be ignored whenever the sampling fraction does 
not exceed 5% and for many purposes even if it is as high as 10%. The 
effect of ignoring the correction is to overestimate the standard error of the 
estimate 7. : 

The following theorem, which is an extension of theorem 2.2, is not re- 


quired for the discussion in this chapter, but it is proved here for later 
reference. 


Theorem 2.3. If y; x; are a pair of variates defined on every unit in 
the population and g, z are the corresponding means from a simple ran- 
dom sample of size n, then their covariance 

Eir E ME Nico n ee 7 T 
EY — Yy(z— X)— ZA NIU 2. Qi — Y)(x — X) (2.15) 
This theorem reduces to theorem 2.2 if the variates y,, x; are equal on 
every unit. 

Proof. Apply theorem 2.2 to the variate u; = y; + x, The population 
mean of u; is U = Y + X, and theorem 2.2 gives 
ra- 0 =N=*_1 Su,- oy 

nN N—1ím' 
that is, 
E[(y — Y) E- XY 
Nor L Sim- +a- DP 9 
nN N—1 ; ; 
Expand the quadratic terms on both sides. By theorem 2.2, 

g—ye-Nan_1 $y yy 

Bias an N ae 9 

with a similar relation for E(z — X)*. Hence these two terms cancel on the 


left and right sides of (2.16). The result of the theorem (equation 2.15) 
follows from the cross-product terms. 


2.6 ESTIMATION OF THE STANDARD ERROR 
FROM A SAMPLE 


The formulas for the standard errors of the estimated population mean 
and total are used primarily for three purposes: (1) to compare the 
Precision obtained by simple random sampling with that given by other 
methods of sampling, (2) to estimate the size of the sample needed in a sur- 
vey that is being planned, (3) to estimate the precision actually attained 
In a survey that has been completed. The formulas involve S?, the popu- 
lation variance. In practice this will not be known, but it can be estimated 
from the sample data. The relevant result is stated in theorem 2.4. 


SIMPLE RANDOM SAMPLING 25 


Theorem 2.4. For a simple random sample 


$o- o 


n—i 
is an unbiased estimate of 
N 
DS (y, — YF 
UU 
N—1 


Proof. We may write 
s t= PAC -— Y GW 


i=l 


= [Ee - »- 9-0] 


Now average over all simple random samples of size n. By the argument 
of symmetry used in theorem 2.2, 


n N 
E —rE | =} 5u- Y} = 
[Eu ) N PC Y’ = 
by the definition of S?. Further, by theorem 2.2, 


n(N — 1) 1) S? 
N 


Eng — Yy]- 4— s 
Hence 
2 S* n(n — 1) — (N s: 
EO) pu =) ICN m (2.17) 
Corollary. Unbiased estimates of the variances of y and Y = Nj are 
y) = j-E(N—)- TS 
OD Ee Nore (1427) (2.18) 
N?s? /(N 
o(f) = sp? = mie) Ni —f) (2.19) 
For the standard errors we take 
S 141—343 Ns 1—; 
—-—Jl-— , sy = —= 1—f 2.20 
E 3 VAREN Jn v (2.20) 


These estimates are slightly biased: for most applications the bias is 


unimportant. ! 
The reader should note the symbols employed for true and estimated 


variances of the estimates. Thus for y we write 
true variance: V(g) = og 
estimated variance: oy) = sg. 
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2.7 CONFIDENCE LIMITS 


It is usually assumed that the estimates j and ¥ are normally distributed 
about the corresponding population values. The reasons for this assump- 
tion and its limitations are considered in section 2.13. If the assumption 


holds, lower and upper confidence limits for the population mean and 
total are as follows: 


Mean: 
A ts ——. + - ts AME 
Yo=9——-JU-f, Yo=9+—=VI=sf (2.21) 
n Jn 
Total: yn 
=. SENS ERA tNs T7 
EDI mL foc rM f (2.22) 


The symbol t is the value of the normal deviate corresponding to the 
desired confidence probability. The most common values are 


Confidence probability (%) S50. 80 190 95 99 

0.67 1.28 1.64 1.96 2.58 
If the sample size is less than 60, the percentage points may be taken from 
Student's t table with (n — 1) degrees of freedom, these being the degrees 
of freedom in the estimated variance s?. The t distribution holds exactly 
only if the observations y; are themselves normally distributed and N is 
infinite. Moderate departures from normality do not affect it greatly. 


For small samples with very skew distributions, special methods are 
needed. 


Example. Signatures to a petition were collected on 676 sheets. Each sheet 
had enough space for 42 signatures, but on many sheets a smaller number of 
signatures had been collected. The numbers of signatures per sheet were counted 


on y random sample of 50 sheets (about a 7% sample), with the results shown in 
able 2.1. 


Estimate the total number of signatures to the petition and the 80 per cent 
confidence limits. 

_ The sampling unit is a sheet, and the observations y, are the numbers of 
Signatures per sheet. Since about half the sheets had the maximum number of 
Signatures, 42, the data are presented as a frequency distribution. Note that 
the original distribution appears to be far from normal, the greatest frequency 
being at the upper end. Nevertheless, there is reason to believe from experience 


E dE mean of samples of 50 are approximately normally distributed. 
e 


n= 2f=50, y =E fø =1471, Efa? = 54,497 
Hence the estimated total number of signatures is 


6 
Î = Nj sOn = 19,888 
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For the sample variance s? we have 


l ; 1 
sic ACE [Ev 


En (1471)? 
= 79) 94497 ->g | = 229-0 


From (2.22) the 80% confidence limits are 


19,888 + INS Vj — f = 19,888 + 
Vn 


(1.28)(676)(15.13) V1 — 0.0740 


v50 
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This gives 18,107 and 21,669 for the 80% limits. A complete count showed 


21,045 signatures. 


TABLE 2.1 
RESULTS FOR A SAMPLE OF 50 PETITION SHEETS 
Number of Signatures Frequency 

yi fi 
42 23 
41 4 
36 1 
32 1 
29 1 
27 2 
23 1 
19 1 
16 2 
15 2 
14 1 
11 1 
10 1 
9 1 
7 1 
6 3 
5 9 
4 1 
3 1 
50 


2.8 AN ALTERNATIVE METHOD OF PROOF 


Cornfield (1944) suggested a method of proving the principal results for 


simple random sampling without replac 


ement that enables us to use 


Standard: results from infinite population theory. Let a, be a random 
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variate which takes the value 1 if the ith unit is in the sample and the value 
0 otherwise. The sample mean j may be written 


1X 
F=- ay (2.23) 
ni 

where the sum extends over all N units in the population. In this expression 


the a; are random variables and the y, are a set of fixed numbers. 
Clearly 


n 
T Pua = 0) 1 — 2 
Ha ) N 


Thus a, is distributed as a binomial variate in a single trial, with 
P = n|N. Hence 


prx )epg-^[-—. 
E(a) =P we Va) = PQ s z) (2.24) 


To find V(y) we need also the covariance of a; and a;. 
4,0, is 1 if the ith and jth unit are both in the sample and is ze 
The probability that two specific units are both in the sam 
to be n(n — 1)/N(N — 1). Hence 


Cov (a,a;) = E(a,aj) — E(a,)E(a,) 


" Te m j- d ont: » z) (2.25) 


The product 
ro otherwise. 
ple is easily found 


N 
Applying this approach to find V(g), we have from (2.23), 


17% N 
V(y) = “|= yi V(aj) + 2 Y yy; Cov (aa) 
n? Liz i<j 


(xs - Ewe) 


N—1 
using (2.24) and (2.25). Completing the square on the cross-product term 
Bives 


TG AUN 1 
V(y) = —J[—À. PM T P 
Cre (Gay de mu] 


ele ye — 0 -f)s? 
n(N — p2&- De — 

The m 
find high 
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A similar approach applies when sampling is with replacement. In this 
event the ith unit may appear 0, 1, 2,---, n times in the sample. Let /; be 
the number of times that the ith unit appears in the sample. Then 


j=} Diy (2.26) 


Since the probability that the ith unit is drawn is 1/N at each draw, the 
variate t; is distributed as a binomial number of successes out of n trials 


with p = 1/N. Hence 


n 1 1 
==>, Wt)-n|-J|i--— ! 
E(t) = (t) n( zi 1) (2.27) 
Jointly, the variates 7; follow a multinomial distribution. For this, 
Cov (tt) = — AT (2.28) 
Using (2.26), (2.27), and (2.28), we have, for sampling with replacement, 
p-l[$450-D. 5 xl 
V(y) = 2s N? 22 ys Ni 
nod ni ore NIS 
IS D ai 
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Frequently the quantity that is to be estimated from a simple random 
sample is the ratio of two variables both of which vary from unit to unit. 
In a household survey examples are the average number of suits of clothes 
per adult male, the average expenditure on cosmetics per adult female, and 
the average number of hours per week spent watching television per child 
aged 10 to 15. In order to estimate the first of these items, we would 
record for the ith household (i = 1, 2, ***, n) the number of adult males 
x, who live there and the total number of suits y; that they possess. The 
population parameter to be estimated is the ratio 


total number of suits 
~~ total number of adult males 5 
1 


R 


The corresponding sample estimate is 
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Examples of this kind occur frequently when the sampling unit (the 
household) comprises a group or cluster of elements (adult males) and our 
interest is in the population mean per element. Ratios also appear in many 
other applications, for example, the ratio of loans for building purposes to 
total loans in a bank or the ratio of acres of wheat to total acres on a 
farm. 

The sampling distribution of & is more complicated than that of 7 
because both the numerator y and the denominator vary from sample 
to sample. In small samples the distribution of R is skew and R is usually 
a slightly biased estimate of R. In large samples the distribution of R 
tends to normality and the bias becomes negligible. The following approxi- 
mate result will serve for most purposes: the distribution of f is studied in 
more detail in Chapter 6. 

Theorem 2,5, If varjates y,, x, 


are measured on each unit of a simple 
random sample of size n, assu 


med large, the variance of f — glz is 


approximately 
N 
— Rz}? 
ica Ao 1. 
V R = 1 2.29 
(R) IA cano (2.29) 
where R = Y] Y is the ratio of the population means and f= nN. 
Proof. 
R-R=¥_ RoI (2.30) 
a“ T 


If n is large, z should not differ greatly from X. The approximation con- 
sists in replacing z by X in the denominator of (2.30). This gives 


R-Rr=JZR7 
2 


(2.31) 
Now average over all simple random samples of size n. 
.E(g—Rz Y-RY 
E(R — R) = AY — Rz) X ——— E 2.32 
) y Y Q.32) 


since R= Y/Y. This shows that to th 
Ris an unbiased estimate of R, 


From (2.31) we also obtain the result 


V(R) = E(R — R = 


€ order of approximation used here 


TET pa 
ye EY — Rzy 


The quantity 7 — Rz is the sam i 
y sample mean of the variate d, = y; — Rx, 
Whose populetion mean D= Y_ ry — 0. Hence we can find Y) by 
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applying theorem 2.2 for the variance of the mean of a simple random 
sample to the variate d; and dividing by X?. This gives 


1 < <4 1 s 
VA = EW — R= pota) 


This completes the proof. 

The way in which theorem 2.5 was proved is worth noting. It was 
shown that the formula in theorem 2.2 for the variance of the sample 
mean j gives the formula for the approximate variance of the ratio /Z, if 
the variate y; is replaced by the variate (y; — Rz,)/X. The same result, or 
its natural extension, holds also in more complex sampling situations and 
is used frequently later in this book. 

As a sample estimate of 


N 
à — Raj 
N-1 
it is customary to take 
n 
X(— Ra? 
iz 
n—1 


This estimate can be shown to have a bias of order 1/n. 
For the estimated standard error of Å, this gives 


s(R) = aA E — Ru? Q.33) 


n—i 


If X is not known, the sample estimate z is substituted in the denominator. 
The quickest way to compute s(R) on a desk machine is to express it as 


1—f [vYy3—2R Y yz, -- PY x2 


Example. Table 2.2 shows the number of persons («,), the weekly family 
income (x), and the weekly expenditure on food (y) in a simple random sample 
of 33 low-income families. Since the sample is small, the data are intended only 
to illustrate the calculations. > j 

Estimate from the sample (a) the mean weekly expenditure on food per family, 
(6) the mean weekly expenditure on food per person, (c) the percentage of the 
income that is spent on food. Compute the standard errors of these estimates, 
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Weekly Expenditure on Food per Family. This is the ordinary sample mean 


By theorem 2.2 (ignoring the fpc) its standard error is 


_ PCE 1 / OF 
POTN AT Uva DN 7 


1 RP Sy 
= —— — V 28,224 — (9072433 = $1.76 
V (33)32) ‘ 2! 


(The uncorrected sum of Squares 28,224 is given underneath Table 2.2). 


TABLE 2.2 
Size, WEEKLY INCOME, AND Foop Cost or 33 FAMILIES 
Food Fooc 
Family Size Income Cost Family Size Income Cost 
Number Tl Ta y Number  z, To y 
Se e 

1 2 62 14.3 18 4 83 36.0 
2 3 62 20.8 19 2 85 20.6 
3 3 87 22.7 20 4 73 27.7 
4 5 65 30.5 21 2 66 25.9 
5 4 58 41.2 22 5 58 23.3 
6 7 92 28.2 23 3 77 39.8 
7 2 88 24.2 24 4 69 16.8 
8 4 79 30.0 25 J 65 37.8 
9 2 83 24.2 26 3 77 34.8 
10 5 62 44.4 27 3 69 28.7 
i 3 63 13.4 28 6 95 63.0 
12 6 62 19.8 29 2 TI 19.5 
13 4 60 29.4 30 2 69 21.6 
14 4 75 27.1 31 6 69 18.2 
15 2 90 22.2 32 4 67 20.1 
16 5 75 37.7 33 2 63 20.7 

"y 3 69 22.6 


Total 123 2394 907.2 


da? = 533, Ya? = 177,254, SES 
Dey = 3595.5, Yay = 66,678 
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The sums of squares and products needed to compute S(R) by (2.34) are found 
under Table 2.2. We need in addition 


2R, = 14.7512, Ř,? = 54.3996,  z, = 3.7273 


Extra decimals are carried in R,, 2R,, R;? to preserve accuracy. 
Hence from (2.34) 


sf) = 1 (28,224) — (14.7512)(3595.5) + (54.3996)(533) 
UC V330.1273) 32 ^ 
— $0.534. 


Percentage of Income Spent on Food. This again is a ratio of two variables 


dy _ (100)(907.2) 
yz, 234 


By (2.34) the reader may verify that the standard error is 2.38%. 


Ry = 100 = 37.9% 


2.10 ESTIMATES OF MEANS OVER SUBPOPULATIONS 


In many surveys, estimates are made for each of a number of classes into 
which the population is subdivided. In a household survey separate esti- 
mates might be wanted for families with 0, 1, 2,... children, for owners 
and renters, or for families in different occupation groups. The term 
domains of study has been given to these subpopulations by the U.N. 
Subcommission on Sampling (1950). 

In the simplest situation each unit in the population falls into one of the 
domains. Let the jth domain contain N; units, and let n, be the number of 
units in a simple random sample of size n that happen to fall in this domain. 


If yj, (k = 1, 2, +++, nj) are the measurements on these units, the popu- 
lation mean Y; for the jth domain is estimated by 
= Ly 
z=} = (2.35) 
k=1 n; 


At first sight g; seems to be a ratio estimate as in section 2.9, for, 
although n is fixed, n; will vary from one sample of size n to another. The 
complication of a ratio estimate can be avoided by considering the distri- 
over samples in which both 7 and n; are fixed. 
given n and n; the probability that any 
units in domain j is drawn is 


bution of gy; I 
In the totality of samples with 
specific set of n; units from the N; 


N-NjCn=ns a 1 
N-NCn-ni š NjCn; N;Ĉn; 


m domain j can appear with all 


Sin ecific set of n; units fro a , 
ESO. ; Nj) that are not in domain j, the 


selections of (n — n;) units from the (N — 
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numerator above is the number of samples containing a specified set of n,, 
and the denominator is the total number of samples. It follows that theo- 
rems 2.1, 2.2, and 2.4 apply to the y,, if we put n; for n and N; for N. 


From theorem 2.1: j; is an unbiased estimate of Y, (2.36) 
S SS —À 
From theorem 2.2: the standard error of g, is e V1 — (njJN;) (2.37) 
n; 
where 
X a — Y 

Sf => 22> 2.38 
j 2: Ni (2.38) 


From theorem 2.4: An estimate of the standard error of y; is 

5 JT— (njNj) (2.39) 
vn; 

where 


nj —3 mme 

spa XUL y;) (2.40) 
xi n—l 

If the value of N; is not known 

n,/N; when computing the 


unbiased estimate of n/N.) 


, the quantity n/N may be used in place of 
fpc. (With simple random sampling, n,/N, is an 


2.11 ESTIMATES OF TOTALS OVER SUBPOPULATIONS 


Ina firm's list of accounts receivable, in which some accounts have been 
paid and some not, we might wish to estimate by a sample the total dollar 
amount of unpaid bills, If N; (the number of unpaid bills in thepopulation) 
is known, there is no problem. The sample estimate is N;y; and its con- 


ditional standard error is N; times expression (2.37). 
Alternatively, 


D if ES total amount receivable in the list is known, a ratio 
can be used. The sample gives an estimate of the ratio (total 
amount of unpaid bills)/( : ( 


total amount of all bills). This is multiplied by 
Teceivable in the list. 


repeated samples of size n. 
not help in this problem, 
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In presenting the proof we revert to the original notation, in which y; is ; 
the measurement on the ith unit in the population. Define for every unit in 
the population a new variate y; , where 
, _ {y if the unit is in the jth domain, 
Y: = \0 otherwise 


The population total of the y, is 
N 
Xw- > w-Y, 
t=1 jth dom 
Ina simple random sample of size n, y; = y; for each of the n, units that 
lie in the jth domain; y; = 0 for each of the remaining n — n; units. If 
y’ is the ordinary sample mean of the y,’, the quantity 


AA, seid 
Ny -23Yw-—-A/5a— P, 
ni=l Nk=1 


This result shows that the estimate f, as defined in equation (2.41) is N 


times the sample mean of the y/'. 
In repeated samples of size n we can clearly apply theorems 2.1, 2.2, and 


2.4 to the variates y/. These show that f, is an unbiased estimate of Y; 


with standard error 
(f) = Ni Ji - GIN) (2.42) 


n standard deviation of the y. In order to com- 
nsisting of the N; values y; that are 
values. Thus 


where S’ is the populatio 
pute S’, we regard the population as co. 
in the jth domain and of N — N; zero 


ga = — Y w- S| (2.43) 
N- 1 jthdom N 
From theorem 2.4 a sample estimate of the standard error of 'Y; is 


(2) = T JI- GIN) (2.44) 


not in the jth domain is given a zero value. Some 


In computing s’, any unit ain i 
gent ei to ae a psychological objection to doing this, but the 
method is sound. 

The methods of this and the pre 
Which the frame used contains uni 
as it has been defined. An exampl 
minor household expenditures a simple random 
rder to estimate the total spent for operation 


receding section also apply to surveys in 
ts that do not belong to the population 
e illustrates this application. 


Example. Froma list of 2422 
sample of 180 items was drawn in 0! 
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of the household. Certain types of expenditure (on clothing and car upkeep) 
were not considered relevant. Of the 180 sample items, 152 were relevant. The 
sum and uncorrected sum of squares of the relevant amounts (in dollars) were 


as follows: 
> y; = 343.5, Dui = 1491.38 


Estimate the total expenditure for household operation and give the standard 


error of the estimate. 
n 


LN... Q42)043.5) 
f, 7345 =e 


= $4622 
From (2.44) 


f£) = viu 
Vn 
In computing s’ we regard our sample of 180 items as having 28 zeros. Hence 


E 
rz "aX - Zur) 


180 


el or) y 
= aus (1491.38 EE 4.670 
Finally, 


4.670 180 
Sy, = (2422) "iso ( 1 -2) — $375 


applicati 
tribute 


h is known, 


Consequently it is worth examini i 
: à mining by how much V(¥,) is reduced when 
N; is known, Tf N, is not known, then from (2.42) i 


26/2 
WÊ) = E 2 z) 
n 
If Y, 
M S; are the mean and standard deviation in the domain of 
*©-, among the nonzero units) the reader may verify that 


BEDS" = (Ny = Dee xv X X) 
N 
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Since terms in 1/N; and 1/N are nearly always negligible, 
S? P,S? + P,Q,Y? (2.45) 
where P; = N,/N and Q; = 1 — P;. This gives 


v(t) = T (P,S? + roy = =) (2.46) 


If nonzero units are identified, we draw a sample of size n; from them. 
The estimate of the domain total is N;7; with variance 


= N R n; N? n; 
V(Njg) = — S «( — A) = X ps; (i — z) 2.47 
( ED n; j N, n; - c) N; ¢ ) 

The comparable variances are (2.46) and (2.47). In (2.46) the average 
number of nonzero units in the sample of size n is nP;. If we take n; = nP; 
in (2.47), so that the number of nonzeros to be measured is about the 
same with both methods, (2.47) becomes 


2 N? « z) 
UNI est Psa 2. 
(N39) 3 j N (2.48) 


The ratio of the variances (2.48) to (2.46) is 
V(N; known) _ S G? 


V(N, not known) S?--Q;Y? C?+Q; 
where C; = S,/Y, is the coefficient of variation among the nonzeros. 
As might be expected, the reduction in variance due to a knowledge of N; is 
greater when the proportion of zero units is large and when y; varies 
relatively little among the nonzero units. For further study of this prob- 
lem, see Jessen and Houseman (1944). 


2.12 COMPARISONS BETWEEN DOMAIN MEANS 


Let j;, y, be the sample means in the jth and kth of a set of domains into 
which the units in a simple random sample are classified. The variance of 
their difference is 

VG; — 9) = VG) + VO) 
This formula applies also to the difference between two ratios R; and R,. 

One point should be noted. It is seldom of scientific interest to ask 
whether Y, = Y, because these means would not be exactly equal in a 
finite population, except by a rare chance, even if the data in both domains 
were drawn at random from the same infinite population. Instead, we 
test the null hypothesis that the two domains were drawn from infinite 
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populations having the same mean. Consequently we omit the fpc when 
computing V(y,) and V(y,), using the formula 


AE 7 SEES. 
V(g,— 5) = 4+ 
n nm 
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Confidence that the normal approximation is adequate in most practical 
situations comes from a variety of sources. In the theory of probability 
much study has been made of the distribution of means of random sam- 
ples. It has been proved that for any population which has a finite standard 
deviation the distribution of the sample mean tends to normality as 
n increases (see, €.g., Feller, 1957). This work relates to infinite populations. 

For sampling without replacement from finite populations, Hajek (1960) 
has given necessary and sufficient conditions under which the distribution 
of the sample mean tends to normality, following work by Erdés and 
Rényi (1959) and Madow (1948). Hájek assumes a sequence of values 
^, N, tending to infinity in such a way that (N, — n,) also tends to 
infinity. The measurements in the "th population are denoted by y,; 
€ — 1,2,:--,N). For this population, let S,, be the set of units in the 
Population for which 

lys — Yl cn — fs, 


» f, are the population mean, s.d. and fpc, and 7 is a number > 0. 
Lindeberg-type condition 


Where Y, 
Then the 


vem (N,— DS 
1S necessary and sufficien 


on ion is accurate enough?" Non- 
? Vary great i : 
fee Lo Y greatly both in the nature and in the degree of 


1 y. In sampling practic it 
y distributions will all b em i Hie 
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Frequency 


0 E 00 500 600 700 .800 900 1000 1100 
City size (thousands) 


Fig.2.1. Frequency distribution of sizes of 196 United States cities in 1920. 


frequency distribution of the numbers of inhabitants in 196 large United 
States cities in 1920. (The four largest cities, New York, Chicago, Phila- 
delphia, and Detroit, were omitted. Their inclusion would extend the 
horizontal scale to more than five times the length shown and would, of 
course, greatly accentuate the skewness.) Figure 2.2 shows the frequency 


Frequency 


Millions 


Fig.2.2 Frequency distribution of totals of 200 simple random samples with n= 49. 
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distribution of the total number of inhabitants in each of 200 simple ran- 
dom samples, with n = 49, drawn from this population. The distribution 
of the sample totals, and likewise of the means, is much more similar to a 
normal curve but still displays some positive skewness. 

In any discussion of the validity of the normal approximation we must 
define what it means to say that the normal approximation is ‘accurate 
enough.” In sample surveys the normal approximation is used primarily 
to calculate confidence limits. When 95% confidence limits are computed 


for the population mean Y by the normal approximation, we make the 
following statement: 


j — 1.96, < Y < y + 1.965, (2.49) 


With repeated sampling, we claim that statements of this kind will be 
wrong only 5% of the time. Consequently we might say that the normal 
approximation is accurate enough if such statements are in fact wrong 
between 4 and 6% of the time. The choice of the numbers 4 and 6 is 
arbitrary: some workers may be satisfied with wider limits. 

From the study of theoretical distributions that are skewed and from the 
results of sampling experiments on actual skewed populations, some 
statements can be made about what usually happens to confidence pro- 
babilities when we sample from positively skew populations. The sample 
size is assumed large enough so that the distribution of 7 shows some 
approach to normality, as in Fig. 2.2. The statements are as follows: 


1. The frequency with which the assertion 
7 — 1.965; < Y < 7+ 1.965; 


is wrong is usually higher than 5%. 
2. The frequency with which 


i Y » 7+ 1.96s; 
18 greater than 2.5%, 


3. The frequency with which 


i Y<7 - 
is less than 2.597. <y — 1.96s; 


y which is essentially binomially 
n of y can be read from the bino- 


SIMPLE RANDOM SAMPLING 41 


Y = Ph. A simple random sample of size n shows a units which have the 
value A and n — a units which have the value 0. For the sample, 


> y= ah, J= ah 
n 
(n — 1)s? —- » y? 5 ng? Lp a'h? 
n 
Q2. $ ain — a) 
n m n—i 
Hence 95% confidence limits for Y are 
7 + 1.96s; = Ha + 1.96,/ ann (2.50) 


Let n = 400, P = 0.1. Then Y = 0.1h. By trial we find that ifa = 29 
in expression (2.50) the upper confidence limit is 39.18//400 = 0.098}, 
whereas a = 30 gives 40.34h/400 = 0.1014. Hence any value of a < 29 
gives an upper confidence limit that is too low. Similarly we find that if 
a > 54 the lower limit is too high. 

The variate a follows the binomial distribution with n = 400, P = 0.1. 
The tables (Harvard Computation Laboratory, 1955) show that 


Pr (stated upper limit too low) = Pr (a < 29) = 0.0357 
Pr (stated lower limit too high) = Pr (a > 54) = 0.0217 


Pr (confidence statement wrong) = 0.0574 


The total probability of being wrong is not far from 0.05. In more than 
60% of the wrong statements, the true mean is higher than the stated upper 
limit. 

There is no safe general rule as to how large n must be for use of the 
normal approximation in computing confidence limits. For populations in 
which the principal deviation from normality consists of marked positive 
skewness, a crude rule which I have occasionally found useful is 


n > 25G? 
where G, is Fisher’s measure of skewness (Fisher, 1932). 
LEGS vas Yum N y, — Y} 
e e OA 


ed so that a 95 % confidence probability statement will 
9/ of the time. It is derived mathematically by 
ance due to moments of the distribution of 7 
The rule attempts to control only the 


This rule is design 
be wrong not more than 6 
assuming that any disturb 
higher than the third is negligible. 
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total frequency of wrong statements, ignoring the direction of the error 
of estimate. 

By calculating G,, or an estimate, for a specific population, we can ob- 
tain a rough idea of the sample size needed for application of the normal 
approximation to compute confidence limits. The result should be checked 
by sampling experiments whenever possible. 


Example. The data in Table 2.3 show the numbers of acres devoted to crops 


TABLE 2.3: 
FREQUENCY DISTRIBUTION OF ACRES IN CROPS ON 556 FARMS 
Class Coded 


Intervals Scale Frequency fii fu? faë 

(acres) Yi i 

0-29 —0.9 47 —42.3 38.1 —34.3 

30-63 0 143 0 0 0 

64-97 1 154 154 154 154 

98-131 2 82 164 328 656 
132-165 3 62 186 558 1,674 
166-199 4 33 132 528 2,112 
200-233 5 13 65 325 1,625 
234-267 6 6 36 216 1,296 
268-301 1 28 196 1,372 
302-335 8 6 48 384 3,072 
336-369 9 2 18 162 1,458 
Apr 10 0 0 0 0 
404-4 11 
438-471 12 i A ee d 
472-505 13 2 26 338 4,394 

Totals 556 836.7 3,469.1 20,440.7 

836.7 
Ey)-Y- -556 = 1.50486 


3469.1 
Ey?) = 5567 = 6.23939 


Eu’) = 


0,440.7 
ssp = 36.76385 
9 = Ey?) — Y? = 3.97479 


^s = E — YF = EW) — sqm y +278 
= 15.411 


Gis K, 15.411 


3 
acz = 19 
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on 556 farms in Seneca County, New York. The data come from a series of 
studies by West (1951), who drew repeated samples of size 100 from this popula- 
tion and examined the frequency distributions of y, s, and Student's ¢ for several 
items of interest in farm management surveys. 

The computation of G, is shown under the table. The computations are made 
on a coded scale, and, since G, is a pure number, there is no need to return to the 
original scale. Note that the first class-interval was slightly different from the 
others. 

Since G, = 1.9, we take as a suggested minimum zt 

n = (25)(1.9)? = 90 
For samples of size 100, West found with this item (acres in crops) that neither 
the distribution of 7 nor that of Student's r differed significantly from the corre- 
sponding theoretical normal distributions. 


Good sampling practice tends to make the normal approximation more 
valid. Failure of the normal approximation occurs mostly when the popu- 
lation contains some extreme individuals which dominate the sample 
average when they are present. However, these extremes also have a much 
more serious effect of increasing the variance of the sample and decreasing 
the precision. Consequently, it is wise to segregate them and make sepa- 
rate plans for coping with them, perhaps by taking a complete enumera- 
tion of them if they are not numerous. This removal of the extremes from 
the main body of the population reduces the skewness and improves the 
normal approximation. This technique is an example of stratified sam- 
pling, which is discussed in Chapter 5. 


2.14 EFFECT.OF NON-NORMALITY ON THE ESTIMATED 
VARIANCE 


One effect of non-normality is that the estimated variance s? may be 
more highly variable from sample to sample than we expect if we assume 
that we are sampling from a normal distribution. For any infinite popu- 
lation, the variance of s? in random samples of size n is (Fisher, 1932) 

2g* c 
y(s) 2 —— + 5 


(2.51) 
n—1 n 


The first term after the equality sign is the value that the variance of s? has 
when the parent distribution is normal. The second term represents the 
effect of non-normality. The quantity x, is Fisher's fourth cumulant 
(Fisher, 1932) and is given by 
k, = Ely; — Y} — 30% 

Note that skewness in the original distribution, as measured by G,, does 
Not affect the stability of s$?: the important factor is the fourth moment in 
the parent population. 
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The cumulant «x, is zero for a normal distribution. It may take either 
positive or negative values in other distributions, but in those encountered 
in sampling practice x4 appears to be positive much more often than nega- 
tive and may have a high value for some parent distributions. 

We may write (2.51) as 


2o* n —ix 2o* n—1 
va = 2 (1 "az 2 
(5) n—1 * 2n o* n—1 t on S 


where G, = «,/o4 is Fisher’s measure of kurtosis (loc. cit). The quantity 
inside the parentheses shows the factor by which the variance of s? is 
inflated owing to non-normality. Note that the factor is almost independ- 
ent of n, so that the inflation remains even with large samples. 

For West’s data on farm acres in crops (Table 2.3), the value of G; will 


be found to be about 6. Thus V(s*) is close to four times as large as would 
be assumed if we regarded the 


original distribution of acres in crops as 

normal. In his sampling studies West found a similar inflation in the 

variance of the standard deviation s in three items he tested. The ratio of 

“V(s) to the theoretical variance of s from a normal population was 3.7 for 

acres in crops, 2.1 for total acres operated, and 13.7 for productive-man- 

aa es (By theory this ratio should be roughly the same for s as 
or s?. 

The relevance of these results in practical sampling is that we sometimes 
use values of s? to compare the precision of one method of sampling with 
that of another or to estimate the sample size needed to attain a specified 
degree of precision in y (see Chapter 4). For these purposes it is well to 
have some idea of the precision of the estimate s?, particularly if it has been 
calculated from rather scanty data. As the previous results indicate, use of 

the “normal” formula for 


isleadiles i appraising the variance of s? may give a very 
misleading Impression of the stability of s2, 


EXERCISES 
21 In 


Calculate the aa mith N = 6 the values of y; are 8, 3, 1, 11, 4, and 7. 

Verify that gj is zs nba for all possible simple random samples of size 2. 

theorem 2/2, estimate of Y and that its variance is as given in 
2.2 For the same i 

SIZE Verify that Es) ation, calculate s? for all simple random samples of 
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2.4 A simple random sample of 30 households was drawn from a city area 
containing 14,848 households. The numbers of persons per household in the 
sample were as follows: 


5, 6, 3, 3, 2, 3, 3, 3, 4, 4, 3, 2, 7, 4, 3, 5, 4, 4, 3, 3, 4, 3, 3, 1, 2, 4, 3, 4, 2, 4 


Estimate the total number of people in the area and compute the probability 
that this estimate is within +10% of the true value. 

2.5 Ina study of the possible use of sampling to cut down the work in taking 
inventory in a stock room, a count is made of the value of the articles on each 
of 36 shelves in the room. The values to the nearest dollar are as follows. 


29, 38, 42, 44, 45, 47, 51, 53, 53, 54, 56, 56, 56, 58, 58, 59, 60, 60, 
60, 60, 61, 61, 61, 62, 64, 65, 65, 67, 67, 68, 69, 71, 74, 77, 82, 85. 


The estimate of total value made from a sample is to be correct within $200, 
apart from a 1 in 20 chance. An advisor suggests that a simple random sample 
of 12 shelves will meet the requirements. Do you agree? 


y =2138, Ly = 131,682 


2.6 After the sample in Table 2.1 (p. 27) was taken, the number of completely 
filled sheets (with 42 signatures each) was counted and found to be 326. Use 
this information to make an improved estimate of the total number of signatures 
and find the standard error of your estimate. J 

2.7 From a list of 468 small two-year colleges a simple random sample of 
100 colleges was drawn. The sample contained 54 public and 46 private 
colleges. Data for number of students (y) and number of teachers (x) are shown 


below. 
n PX) 2o 
Public 54 31,281 2,024 
Private 46 13,707 1,075 


> > wy) 2362) 


Public 29,881,219 — 1,729,349 111,090 
Private 6,366,785 431,041 33,119 


(a) For each type of college in the population, estimate the ratio (number of 
students)/(number of teachers) (b) Compute the standard errors of your 
estimates. (c) For the public colleges, find 90% confidence limits for the 
student/teacher ratio in the whole population. 

2.8 In the preceding example test at the 5% level whether the student/teacher 
ratio is significantly different in the two types of colleges. 

2.9 For the public colleges, estimate the total number of teachers (a) given 
that the total number of public colleges in the population is 251, (b) without 
knowing this figure. In each case compute the standard error of your estimate. 

2.10 The table below shows the numbers of inhabitants in each of the 197 
United States cities which had populations over 50,000 in 1940. Calculate the 
standard error of the estimated total number of inhabitants in all 197 cities for 
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the following methods of sampling: (a) a simple random sample of size 50, (5) a 
sample which includes the five largest cities and is a simple random sample of 
size 45 from the remaining 192 cities, (c) a sample which includes the nine 
largest cities and is a simple random sample of size 41 from the remaining cities. 


FREQUENCY DISTRIBUTION OF CiTy SIZES 


Size Class Size Class Size Class 
(1000's) f (1000's) f (1000's) JF 
50-100 105 550-600 2 vues A 
100-150 36 600—650 1 1500-1550 1 
150-200 13 650-700 2 TTE pee 
200-250 — 6 700—750 0 1600-1650 1 
250-300 7 750-800 1 toe oat 
300-350 8 800-850 1 1900-1950 1 
350-400 4 850-900 2 aure 7363 
400-450 1 900-950 0 3350-3400 1 
450-500 3 950-1000 0 Ad bens 
500-550 0 1000-1050 0 7450—7500 1 


Gaps in the intervals are indicated by .... 


2.11 Calculate the coefficient of skewness G; for the original population 
and for the population remaining after removing (a) the five largest cities, (b) the 
nine largest cities. 

2.12 A small survey is to be taken to compare home-owners with renters. 
In the population about 75% are owners, 25% are renters. For one item the 
variance is thought to be about 15 for both owners and renters. The standard 
error of the difference between the two domain means is not to exceed 1. How 
large a sample is needed (a) if owners and renters can be identified in advance of 
drawing the sample, (5) if not. (An approximate answer will do in (6); an exact 
discussion requires binomial tables.) 

2.13 A simple random sample of size 3 is drawn from a population of size 
N with replacement. Show that the probabilities that the sample contains 1, 
2, and 3 different units (for example, aaa, aab, abc, respectively) are, 


1 3(N — 1) ba 
Pi =-= P= _(N — DN ) 
SN, a Ni. Hb Cc SLE 


As an estimate of Y we take g’ i i it 
Em 7', the unw mea erent units 
in the sample. Show that th gened n over the diff 


e average variance of 9 is 
VG’) = QN — IN — ns? 
6N2 
One way to do this is to show that 


y Wr N-— 
ve) = s (t 2 N-3 
N Pi + Sy Pa + Ps 
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Hence show that V(g’) < V(9), where g is the ordinary mean of the n observa- 
tions in the sample. The result that V(y’) < V(g) for any n > Z was proved by 
Raj and Khamis (1958). 

2.14 Two dentists A and B make a survey of the state of the teeth of 200 
children in a village. Dr. A selects a simple random sample of 20 children and 
counts the number of decayed teeth for each child, with the following results: 


Number of decayed 0 1 2 3 4 5 6 7 8 9 10 
teeth/child 
Number of children 8 4 2 2 1 10001 1 


Dr. B, using the same dental techniques, examines all 200 children, recording 
merely those who have no decayed teeth. He finds 60 children with no decayed 
teeth. 

Estirnate the total number of decayed teeth in the village children, (a) using A's 
results only; (b) using both 4’s and B's results. (c) Are the estimates unbiased ? 
(d) Which estimate do you expect to be more precise? 

2.5 A company intends to interview a simple random sample of employees 
who have been with it more than five years. The company has $1000 to spend, 
and each interview costs $10. There is no separate list of employees with more 
than five years service, but a list can be compiled from the files at a cost of $200. 
The company can either (a) compile the list and interview a simple random 
sample drawn frora the eligible employees or (b) draw a simple random sample 
of all employees, interviewing only those eligible. The cost of rejecting those 
not eligible in the sample is assumed negligible. 

Show that for estimating a total over the population of eligible employees, plan 
(a) gives a smaller variance than plan (5) only if V; < 2V Q;, where V; is the 
coefficient of variation of the item among eligible empioyees and Q; is the 
proportion of noneligibles in the company. Ignore the fpc. 
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CHAPTER 3 


Sampling for Proportions 


and Percentages 


3.4 QUALITATIVE CHARACTERISTICS 


Sometimes we wish to estimate the total number, the proportion, or the 
percentage of units in the population which possess some characteristic or 
attribute or fall into some defined class. Many of the results regularly 
published from censuses or surveys are of this form, for example, numbers 
of unemployed persons, the percentage of the population that is native- 
born. The classification may be introduced directly into the questionnaire, 
as in questions that are answered by a simple “yes” or “no.” In other cases 
the original measurements are more or less continuous, and the classifi- 
cation is introduced in the tabulation of results. Thus we may record the 
respondents’ ages to the nearest year but publish the percentage of the 
population aged 60 and over. 

Notation. We suppose that every unit in the population falls into one of 
the two classes C and C'. The notation is as follows: 


Number of units in C in Proportion of units in C in 
Population Sample Population Sample 
A a P=A/N p 7 a[n 


The sample estimate of P is p, and the sample estimate of A is Np or Najn. 
In statistical work the binomial distribution is often applied to estimates 
like aand p. As will be seen, the correct distribution for finite populations 
is the hypergeometric, although the binomial is usually a satisfactory 
approximation. 


3.2 VARIANCES OF THE SAMPLE ESTIMATES 


By means of a simple device it is possible to apply the theorems 
established in Chapter 2 to this situation. For any unit in the sample or 
49 
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population, define y; as 1 if the unit is in C and as 0 if it is in C’. For this 
population of values y,, it is clear that 


= > eS (3.1) 
1 
> 

Vi 

yu LA. p (3.2) 

N N 
Also, for the sample, 

mu (3.3) 

n n 


Consequently the problem of estimating 4 and P can be regarded sant 
of estimating the total and mean of a population in which every y; is either 


2 
lor0. In order to use the theorems in Chapter 2, we first express S? and s 
in terms of P and p. Note that 


N n 
EwW-A-NP Yy?F=a=np 
1 1 


Hence 
N N 
Zu- Y Yy-NY 
S= — 2. GN 
N—1 N-1 
E ^d S M EN PQ (3.4) 
mer, SN) = SG 
where Q = 1 — 


iP} Similarly 


— 24 
t 2 (y; y) Fan s (3.5) 
n—1 n—i 

Application of theorems 2.1, 22, and 24 to this population gives the 
following results for simple random sampling of the units that are being 
classified, 

Theorem 3.1, 
of the populatio 


s 


The sample Proportion p = ajn is an unbiased estimate 
n proportion P = AJN, 
Theorem 3.2. The variance of pis 


V(p) = E(p — p = zx = ") P B 
using (3.4). Š 
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Corollary i. If p and P are the sample and population percentages, 
respectively, falling into class C, (3.6) continues to hold for the variance 
of p. 


Corollary 2. The variance of Â = Np, the estimated total number of 
units in class C, is 


V(A) = Mex (3.7) 


Theorem 3.3. An unbiased estimate of the variance of p, derived from 
the sample, is 
N—n 


GDN” (3.8) 


v(p) = 3» "S 


Proof. Inthecorollary of theorem2.4it was shown that for a continuous 
variate y; an unbiased estimate of the variance of the sample mean j is 


ALS (N—n) 
SoS 3.9 
09) = T — (3.9) 
For proportions, p takes the place of y, and in (3.5) we showed that 
2 n 
$5 ——— 3.10 
arti G0) 
Hence 
N—n 
MON etiem 
v(p) = s, Qn — DN 4 


It follows that if N is very large relative to n, so that the fpc is negligible, 
an unbiased estimate of the variance of p is 


Pq 
n—1 


This result may appear puzzling to some readers, since the expression 
P4q/n is almost invariably used in practice for the estimated variance. The 
fact is that pq/n is not unbiased even with an infinite population. 


Corollary. An unbiased estimate of the variance of 4 = Np, the 
estimated total number of units in class C in the population, is 
N(N — n) 


(A) = suy —— 1 M (3.11) 


Example. From a list of 3042 names and addresses, a simple random sample 
of 200 names showed on investigation 38 wrong addresses. Estimate the total 
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number of addresses needing correction in the list and find the standard error of 
this estimate. We have 


N = 3042, —200 a=38, p=0.19 
The estimated total number of wrong addresses is 


A = Np = (3042)(0.19) = 578 
sa = V{(G042)(2842)(0.19)(0.81)/199] = V6685 = 81.8 


Since the sampling ratio is under 7%, the fpc makes little difference. To remove 
it, replace the term N — n by N. If, in addition, we replace n — 1 by n, we have 
the simpler formula 

Syp = NV pqin = (3042) V (0.19)(0.81)/200 = 84.4 
This is in fairly close agreement with the previous result, 81.8. 


The preceding formulas for the variance and the estimated variance of p 
hold only if the units are classified into C or C’ so that p is the ratio of the 
number of units in C in the sample to the total number of units in the 
sample. In many surveys each unit is composed of a group of elements, and 
it is the elements that are classified. A few examples are as follows: 


Sampling Unit Elements 
Family Members of the family 
Restaurant Employees 

Crate of eggs Individual eggs 

Peach tree 


Individual peaches 


—— M MAX wo 7 BRE 


Ihe simple random sample of units is drawn in order to cstimate the pro- 
portion P of elements in the population that belong to class C, the pre- 
SUM oramus do not apply, Appropriate methods are given in section 


33 THE EFFECT OF P ON THE STANDARD ERRORS 


Equation (3. i 
lanse x 6) shows how the variance of the estimated percentage 


P, for fixed n and N. If the fpc is ignored, we have 
y(p) = £9 
n 


The function P. 


: Q and its square r 
functions may 3 oot are 


hown in Table 3.1. These 
be regarded : S Bt 
Spectively, for a satel of zd n Variance and standard deviation, Te- 
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TABLE 3.1 
VALUES OF PQ AND VPQ 
P = Population percentage in class C 


EE EC 0) 20 30 40 50 60 70 80 90 100 


PQ 0 900 1600 2100 2400 2500 2400 2100 1600 900 0 
VPQ|O 30 40 46 49 50 49 46 40 30 0 


The functions have their greatest values when the population is equally 
divided between the two classes, and are symmetrical about this point. The 
standard error of p changes relatively little when P lies anywhere between 
30 and 70%. At the maximum value of V PQ, 50, a sample size of 100 is 
needed to reduce the standard error of the estimate to 5%. To attain a 1% 
standard error requires a sample size of 2500. 

This approach is not appropriate when interest lies in the total number of 
units in the population which are in class C. In this event it is more natural 
to ask: Is the estimate likely to be correct to within, say, 7% of the true 
total? Thus we tend to think of the standard error expressed as a fraction 
or percentage of the true value, NP. The fraction is 


avs _ NVPO wi fe [N= 3112 
NP XdnNPN N—1 VnN PN N—1 ee) 


This quantity is usually called the coefficient of variation of the estimate. 
If the fpc is ignored, the coefficient is VOjnP. The ratio VOIP, which 
might be considered the coefficient of variation for a sample of size 1, is 
shown in Table 3.2. 


TABLE 3.2 
VALUES OF VQ/P FOR DIFFERENT VALUES OF P 


P = Population percentage in class C 


P 0 01:8 O 1 5 10 20 
VOIP co 31.6 14.1 99 44 30 20 

P 30 40 50 60 70 80 90 
VOTERS | (15a t-2 11059018: 017, © 990/58 9.05 


For a fixed sample size, the coefficient of variation of the estimated total 
in class C decreases steadily as the true percentage in C increases. The 
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From (3.14)' the distribution of the number of males, a, is as follows: 
a Probability 


Q AL 3254 _ 6 
12! 87.65 14 
n Alp 32s 
11! 87.65 14 


4 Impossible =0 


The reader may verify that the mean number of males is $ and the variance is $$. 


These results agree with the formulas previously established in section 3.2, 
which give 


A (4 
E(np) = p." 09. 
N — 5 
Vip) =nPO RTI eA E 


3.6 CONFIDENCE LIMITS 


We first discuss the meaning of confidence limits in the case of quali- 
tative characteristics. In the sample, a out of n fall in class C. Suppose that 
inferences are to be made about the number A in the population which 
falls in class C. For an upper confidence limit to A, we compute a value Ay 
such that for this value the probability of getting a or less falling in C in the 
sample is some small quantity «y, for example, 0.025. Formally, A, satis- 
fies the equation 


> Pr(j,n—j|4y,N — Ay) = v (3.15) 
j-0 


where Pr is the probability term for the hypergeometric distribution, as 
defined in (3.14). 

When a; is chosen in advance, (3.15) requires in general a nonintegral 
value of Aj to satisfy it, whereas conceptually Âg should be a whole 
number. In practice we choose Êy as the smallest integral value of A such 
that the left side of (3.15) is less than or equal to «p. Similarly, the lower 
confidence limit 4; is the largest integral value such that 


ÈPr (jn — j| Âr N — Âr) < oy (3.16) 


j=a 
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Confidence limits for P are then found by taking Py = Ay/N, P; = 
A,|N. 
Numerous methods are available for computing confidence limits. 


Exact Methods 

Chung and DeLury (1950) present charts of the 90, 95, and 99% limits 
for P for N = 500, 2500, and 10,000. Values for intermediate population 
sizes are obtainable by interpolation. Lieberman and Owen (1961) give . 
tables of individual and cumulative terms of the hypergeometric distri- 
bution, but N extends only to 100. 


The Normal Approximation 
From (3.8) for the variance of p, one form of the normal approximation 
to the confidence limits for P is 


pc [wi — fv pal(n — 1) + i| (3.17) 


where f = n/N and t is the normal deviate corresponding to the confidence 
probability. For those who prefer it, use of the more familiar term v pain 
seldom makes an appreciable difference. The last term on the right is a 
correction for continuity. This produces only a slight improvement in the 
approximation. However, without the correction, the normal approxi- 
mation usually gives too narrow à confidence interval. 

The error in the normal approximation depends on all the quantities 
n, p, N, «y, and «z. The quantity to which the error is most sensitive is np 
or more specifically the number observed in the smaller class. Table 3.3 


TABLE 3.3 
SMALLEST VALUES OF mp FOR USE OF THE NORMAL 
APPROXIMATION 

np — Number Observed n- 

P in the Smaller Class Sample Size 
M eS a MÀ 

0.5 15 30 
0.4 20 50 
0.3 24 80 
0.2 40 200 
0.1 60 600 
0.05 70 1400 
~0* 80 oo 


Se 
* This means that p is extremely small, so that np follows 
the Poisson distribution. 
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gives working rules for deciding when the normal approximation (3.17) 
may be used. A 
The rules in Table 3.3 are constructed so that with 95% confidence 
limits the true frequency with which the limits fail to enclose P is not 
greater than 5.5%. Further, the probability that the upper limit is below 
P is between 2.5 and 3.5%, and the probability that the lower limit exceeds 
P is between 2.5 and 1.5%. 


Exemple I. Ina simple random sample of size 100, from a population of size 
500, there are 37 units in class C. Find the 95% confidence limits for the pro- 
portion and for the total number in class C in the population. In this example 


n=100, N=500, p=0.37 


The example lies in the range in which the normal approximation is recom- 
mended. The estimated standard error of pis 


v(i — f)pqi(n — 1) = V(0.8)(0.37)(0.63)/99 = 0.0434 
The correction for continuity, 1/2n, equals 0.005. Hence the 95% limits for P 
are estimated as 
0.37 + (1.96 x 0.0434 + 0.005) = 0.37 + 0.090 
P, = 0.280, Py = 0.460 


The limits as read from the charts by Chung and DeLury are 0.285 and 0.462, 
respectively. 


To find limits for the total number in class C in the population, we multipl 
by N, obtaining 140 and 230, respectively. Pe ; RT 


Binomial Approximations 


When the normal approximation does not apply, 
found from the binomial tables (section 3.4) and adjusted, if necessary, to 
take account of the fpc. Table VIII: in Fisher an 


: : y d Yates's Statistical 
Tables (1957) gives binomial confidence limits for P for any value of n, and 
is a useful alternative to the ordinary binomial tables. Example 2 shows 


how the binomial approximation is computed. 


limits for :P may be 


Example 2. For another item in the sample in example 1, nine of the 100 
units fall in class C. From Romig’s table for n = 100 the 95% limits for P are 
found to be 0.041 and 0.165. (The Fisher-Yates tables give 0.042 and 0.164.) 
If f, the sampling fraction, is less than 5%, limits found in this way are close 
enough for most purposes. In this example, f — 0.2 and adjustment is needed, 

To apply the adjustment, we shorten the interval between P and each limit by 
the factor V1 — f = V0.8 = 0.894. The adjusted limits are as follows: 


P, = 0.090 — (0.894)(0.090 — 0.041) — 0.046 
Py = 0.090 + (0.894)(0.165 — 0.090) = 0.157 


The limits read from the charts b: 
respectively. 


y Chung and DeLury are 0.045 and 0.157, 
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Example 3. In auditing records in which a very low error rate is demanded, 
the upper confidence limit for A is primarily of interest. Suppose that 200 
of 1000 records are verified and that the batch of 1000 is accepted if no errors are 
found. Special tables have been constructed to give the upper confidence limit 
for the number of errors in the batch. A good approximation results from the 
following relation. The probability that no errors are found in n when A errors 
are present in N is, from the hypergeometric distribution, 


a E ae) ul] 
NN-D»--(N-n)) N-u- 


where u = (n — 1)/2. For example, with n = 200, A = 10, N = 1000, the 
approximation gives (890.5/900.5)?99, which is found by logs to be 0.107. Thus 
A = i0 (a 1% error rate) is approximately the 90% upper confidence limit for 
the number of errors in the batch. 


3.7 CLASSIFICATION INTO MORE THAN TWO CLASSES 


Frequently, in the presentation of results, the units are classified into 
more than two classes. Thus a sample from a human population may be 
arranged in 15 five-year age groups. Even when a question is supposed to 
be answered by a simple “yes” or “no,” the results actually obtained may 
fall into four classes: "yes," "no," “don’t know," and "no answer." 
The extension of the theory to such cases is illustrated by the situation in 
Which there are three classes. j d 

We suppose that the number falling in the ith class is A; in the popu- 
lation and a; in the sample, where 

N=)DA,, n=) 4, pa, n= 

When the sample size 7 is small in relation to all the A,, the probabilities 
P, may be considered effectively constant throughout the drawing of the 
sample. The probability of drawing the observed sample is given by the 
multinomial expression 


! 
Pr (aj) = — PRI (3.18) 
1 a! as! az! 


This is the appropriate extension of the binomial distribution and is a good 


approximation when the sampling fraction is small. > 
The correct expression for the probability of drawing the observed 


sample is 
AA [As\ | [N 
Pr (a; | 4) = NN li : (3.19) 
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This expression is the natural extension of (3.14), section 3.5, for the hyper- 
geometric distribution. The numerator is the number of distinct samples 
of size n that can be formed with a, units in class 1, a, in class 2, and ag in 
class 3. 


3.8 CONFIDENCE LIMITS WHEN THERE ARE 
MORE THAN TWO CLASSES 


Two different cases must be distinguished. 
Case 1. We calculate 
number in any one class in sample _a 


p——————Á 
n 


=| 


or 
pe total number in a group of classes _ 4 + 4s + ay 
ER RENE 
In either of these situations, although the original classification contains 
more than two classes, P itself is obtained from a subdivision of the n 
units into only two classes. The theory already presented applies to this 
case. Confidence limits are calculated as described in section 3.6. 
Case 2. Sometimes certain classes are omitted, p being computed from 
a breakdown of the Temaining classes into two parts. For example, we 
might omit persons who did not know or gave no answer and consider the 
ratio of number of “yes” answers to "yes" plus “no” answers, Ratios 
that are structurally of this type are often of interest in sample surveys. 
The denominator of such a ratio is not n but some smaller number n’. 
Although n’ varies from sample to sample, previous results can still be 
used by considering the conditional distribution of p in samples in which 
both n and n’ are fixed. This device was already employed in section 2.10. 
Suppose that 
= EG 


p= j n' =a, + às, n — à +a, + ay 
d, + a 


so that a, is the number in the sample falling in classes in which we are not 
at the moment interested. Then, as shown in the next section, the condi- 
tional distribution of a, and a, is the hypergeometric distribution obtained 
when the sample is of size n' and the population of size N' = 41 + Ag. 
Hence, from (3.17), the normal approximation to conditiona 
limits for P = A,/(A, + 4) are 


(an —1) w 


l confidence 
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If the value of N’ is not known, n/N may be substituted for n'/N' in the 
fpc term in (3.20). 


3.9 THE CONDITIONAL DISTRIBUTION OF p 


To find this distribution, we restrict our attention to samples of size n in 
which n’ = a, + a; fall in classes 1 and 2. The number of distinct samples 


of this type is 
/INAÍN — N' A, + Ag\ As 
= (3.21) 
n'|in—n a, + a5/ \ag 


Among these samples, the number which have a, in class 1 and a, in class 
2 has already been given as the numerator in (3.19), section 3.7. Dividing 
this numerator by (3.21), we have 


Aj A2). | [A1 + 4 
Pr (a; | Ay, 45 n, n) = (3.22) 
44 / Ml, a, + ds 


This is an ordinary hypergeometric distribution for a sample of size n’ 
from a population of size N’ = 4; + 4s. 


Example. Consider a population that consists of the five units b, c, d, e, f, 
which fall in three classes. 


Class A; Units denoted by 
1 1 b 
2 2 c,d 
3 2 ef x 


J SSS 
With random samples of size 3, we wish to estimate P = A,/(A, + 45), or 


in this case J. Thus N = 5 and N’ = 3. Pw e 
There are 10 possible samples of size 3, all with equal initial probabilities. 


These are grouped according to the value of n’. 


n=l 
Conditional 
Sample a ag P Probability (p — P) 
bef 1 0 1 i $ 
cefordef 0 1 0 $ -i 


ues of a}, as, only two types are obtainable: 
Their conditional probabilities, 4 and 3, 
l expression (3.22). Further, 


E(p) =3 


»- 0-00 -2- 


If samples are specified by the val 
a =l, a —0; a, =0, a5 — l- 
respectively, agree with the genera. 
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The estimate p is unbiased, and its variance agrees with the general formula 
p= (Ma*\PO _ (3=1) (1) (2) _2 
eee Ge 3/13) ^9 


For n' — 2 there are six possible samples, which give only two sets of values 
of ay, a>. 


n -2 
Conditional 
Sample a az p Probability (p — P) 
bee, bof, bde, or bdf 1 1 i $ i 
cde or cdf 0 2 0 i -i 


The estimate is again unbiased and its variance is 


c3 — (2) (1 P "AU ERUNT 

^ — \3) \36 3) \9) ~ 18 

which may be verified from the general formula 
one fourth of that obtained when n=]. 
changes with the configuration of the sa 
or n’ = 3, there is only one possibli 
population fraction, 1. The conditional 
general formula, which 


- Note that the variance is only 


In a conditional approach the variance 
mple that was drawn. 


e sample, bcd. This gives the correct 


variance of p is zero, as indicated by the 
Teduces to zero when N’ = n'. 


3.10 PROPORTIONS AND TOTALS OVER 
SUBPOPULATIONS 


7^ are applicable. The sample data may be 
presented as follows: 
Domainl Domain 2 Domaink Total 
Class Cc (e G (6^ eno Cc (cj 
Number of units aura Gy C EE a. a, n 


Of the n units, (a, + a") are found to fall in domain 1 and of these a, 
fall in class C. The proportion falling in class C in domain 1 is estimated 
by p, = a,/(a, + a’). The frequency distribution and confidence limits 
for p, were discussed under Case 2 in sections 3.8 and 3.9. 

For estimating the total number A, of units in class C in domain l, 
there are two possibilities. If Nj, the total number of units in domain 1 in 
the population, is known, we may use the conditional estimate 


2 Nia 
Â= =Z CA 
t d à. + a! 
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Its standard error is computed as 
s(4,) = NV 1 — (m/N)V Pagal — 1) 


where m = a, + ay’. 
If N, is not known, the estimate is 
Na, 
n 


A! = 
with estimated standard error 
stå) = NVI — (n[N)V pq[(n — 1) 


where p = a,/n. 


3.11 COMPARISONS BETWEEN DIFFERENT DOMAINS 


Since proportions are estimated independently in different domains, 
comparisons between such proportions are made by standard elementary 
methods. For example, to test whether the proportion p, = aj/(a, + a) 
differs significantly from the proportion p, — as[(a5 + ae’), we form the 
usual 2 x 2 table 

Domain 

122 
G a a 
Gate 


Total| m m’ 


The ordinary x? test (Fisher, 1958) or the normal approximation to the 
distribution of (p, — ps) is appropriate. Similarly, comparisons among 
proportions for more than two domains are made by the methods for a 
2 x k contingency table. 

Occasionally it is desired to test whether a; differs significantly from a; 
for example, whether the number of Republicans who favor some proposal 
is greater than the number of Democrats in favor. On the null hypothesis 
that these two numbers are equal in the population, the total n’ = a, + a; 
in the two classes in question should divide with equal probability between 
the two classes. Consequently we may regard a, as a binomial number of 
successes in n’ trials, with probability of success 1 on the null hypothesis. 
It may be verified that the normal deviate (corrected for continuity) is 


Xia, = 1-2 


, 


n 
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3.12 ESTIMATION OF PROPORTIONS IN CLUSTER 
SAMPLING 


As mentioned in section 3.2, the preceding methods are not valid if each 
unit is a cluster of elements and we are estimating the proportion of ele- 
ments that fall into class C. 

If each unit contains the same number m of elements, let p; = aj[m be 
the proportion of elements in the ith unit that fall into class C. The 
tion falling in C in the sample is 


Za. 


nm 


propor- 


12 
> = 23 Di 
n 

that is, the estimate p is the unwei 
sequently, if y; 
directly to give 


ghted mean of the quantities p. Con- 
is replaced by p, the formulas in Chapter 2 may be applied' 
the true and estimated variance of P. 


N 
pE 
V(p) = LL X0. P) 


Zu EET 3.23 
N—1 er 
An unbiased sample estimate of this variance is 
— S — 2 
wp) = S È (pi — p) (3.24) 


n n—41 

Example |. A group of 61 leprosy patients were treated with a drug for 48 
weeks. To measure the effect of the drug on the leprosy bacilli, the presence of 
bacilli at six sites on the body of each patient was tested bacteriologically. Among 
the 366 sites, 153, or 41.8%, were Negative. What is the standard error of this 
percentage? 

This example comes from a controlled ex 
it illustrates how erroneous the b 
formula we have n = 366 and 


~~ e€ experiment rather than a survey, but 
Momial formula may be. By the binomial 


s.e. (p) = Vpgln — 1) = VULICS DES = 2.58% 

Each patient is a cluster unit with m = 6 elements (sites). To find the standard 
error by the correct formula, we need the frequency distribution of the 61 values 
of p;. It is more convenient to tabulate the distribution of Yi the number of 
negative sites per patient. With Pi expressed in per cents, Pi = 100y,/6. From 
the distribution in Table 3.4 we find PN — 669 and 


pz [Xf 3» [zen 
T nn — 1) Eney = 9279 


Hence 


100 
se.(p) = — se. (9) = 4.65% 
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This figure is about 1.8 times the value given by the binomial formula. The 
binomial formula requires the assumption that results at different sites on the 
same patient are independent, although actually they have a strong positive 
correlation. The last line of Table 3.4 shows the expected numbers of patients 


TABLE 3.4 
NUMBER OF NEGATIVE SITES PER PATIENT 
y; = 6p;|100 0 VEU eS) Pera eS, OAT otal 
3 10: 11 doc T4 NT TA 61 
fii 0 11 8 12 28 70 24 153 
Sexp 2.3) 10! 18:31 17:6.. 9:609 218) 03 989110 
with 0, 1,2,... negative sites, computed from the binomial (0.58 + 0.42)°. 


Note the marked excesses of observed frequencies f of patients with zero negatives 
and with five and six negatives. 

If the size of cluster is not constant, let m; be the number of elements in 
the ith cluster unit and let p; = a;/m;. The proportion of units falling in 
class C in the sample is 


Structurally, this is a typical ratio estimate, discussed in section 2.9 and 
later in Chapter 6. It is slightly biased, though the bias is seldom likely to 


be of practical importance. i ; : 
If we put a; for y; and m; for x; in (2.29), the approximate variance of p is 


N 
1— a; — Pm) 
«arg ial 
where P is the proportion of elements in C in the population and M- 
S mN is the average number of elements per cluster. An alternative 
expression is LSE (mè (p P? 
vo t (g) WoT G2 
This form shows that the approximate variance involves a weighted sum 


of squares of deviations of the p; from the population value P. 
For the estimated variance we have 


1—f Xa? -2pXam tpm (3.26) 
XD nm? n—l : 


Where in = Y m,/n is the average number of elements per cluster in the 
i 
sample. 
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Example 2. A simple random sample of 30 households was drawn from a 
census taken in 1947 in wards 6 and 7 of the Eastern Health District of Baltimore. 
The population contains about 15,000 households. In Table 3.5 the persons in 
each household are classified (a) according to whether they had consulted a 
doctor in the last 12 months, (5) according to sex. 


TABLE 3.5 
DATA FOR A SIMPLE RANDOM SAMPLE OF 30 HOUSEHOLDS 
Doctor Seen in 


Number of Last Year 
Number of 


Household Persons Males Females Yes No 
Number 


ai a; 


MO 00 —-1 Ov ta 4» C9 F3 — 


a 
| AwaW nome osu ook oe à 
NeENE € — ON T0 — HN 0 0 UO IN) (2. 09. — R9 O9 IQ — — — EE ewe 
NENN =e ee ee DNR ee NE BP RE RP ENNNNENNWE 
KONONNTCONWHKOCOONKBPOOOCOCOOOCOOWNNSOY 
U2 M L9 t2 l9 QO — UC Ut t9) O t9 4 4& 0 I9. O -1 HR 0 4 d» 0 0 C9) t2 —- AOS 


d 
Q 
E. 
a 
S 
A 
t 
w 
wv 
= 


30 74 
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Our purpose is to contrast the ratio formula with the inappropriate binomial 
formula. Consider first the proportion of people who had consulted a doctor. 
For the binomial formula, we would take 


n=104, p = 753 = 0.2885 


Hence 
pq _ (0.2885)(0.7115) 


YyinlP) =F 104 = 0.00197 


For the ratio formula, we note that there are 30 clusters and take 


n = 30 
m; = total number in ith household 
a; = number in ith household who had seen a doctor 


p = 0.2885, as before 
m = 484 = 34667 
£a? = 86; Xm? — 404; Yam, = 113 


The fpc may be ignored. Hence, from (3.26), 


. (86) — 2(0.2885)(113) + (0.2885)*(404) L 0,00520 
n (30)29)3.4667)* 


e ratio method, 0.00520, is much larger than that given 
0.00197. For various reasons, families differ in the 
frequency with which their members consult a doctor. For the sample asa whole, 
the proportion who consult a doctor is only a little more than one in four, but 
there are several families in which every member has seen a doctor. Similar 
results would be obtained for any characteristic in which the members of the same 
family tend to act in the same way. 

In estimating the proportion of ma. 
By the same type of calculation, we find 


v(p) 


The variance given by thi 
by the binomial formula, 


les in the population, the results are different. 


binomial formula: vp) = 0.00240 
ratio formula: v(p) = 0.00114 


Here the binomial formula overestimates the variance. The reason is interesting. 
Most households are set up as a result of a marriage, hence contain at least one 
male and one female. Consequently thé proportion of males per family varies 
less from 3 than would be expected from the binomial formula. None of the 30 
families, except one with only one member, is composed entirely of males, or 
entirely of females. If the binomial distribution were applicable, with a true P of 
approximately 3, households with all members of the same sex would constitute 
one quarter of the households of size 3 and one eighth of the households of size 4. 
This property of the sex ratio has been discussed by Hansen and Hurwitz (1942). 
Other illustrations of the error committed by improper use of the binomial for- 
mula in sociological investigations have been given by Kish (1957). 
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EXERCISES 


3.1 For a population with N = 6, A = 4, A’ = 2, work out the value of 
a for all possible simple random samples of size 3. Verify the theorems given for 
the mean and variance of p = ajn. Verify that 


N-n 
(n — pN”! 


is an unbiased estimate of the variance of p. 


3.2 Ina simple random sample of 200 from a population of 2000 colleges, 
120 colleges were in favor of a proposal, 57 were opposed, and 23 had no opinion. 
Estimate 95 % confidence limits for the number of colleges in the population that 
favored the proposal. 

3.3 Do the results of the previous sample furnish conclusive evidence that 
the majority of the colleges in the population favored this proposal ? 

3.4 A population with N = 7 consists of the elements By, Cy, Cs, Cs, Dy, Do, 
and D}. A simple random sample of size 4 is taken in order to estimate the pro- 
portion of C’s to C’s + D’s. Work out the conditional distributions of this 
proportion, p, and verify the formula for its conditional variance. 

3.5 In the preceding exercise, what is the probability that a sample of size 
4 contains B,? Find the average variance of p over all simple random samples of 
size 4 and verify that this is 11/280. 

3.6 A simple random sample of 290 households was chosen from a city area 
containing 14,828 households. Each family was asked whether it owned or 
rented the house and also whether it had the exclusive use of an indoor toilet. 
Results were as follows: 


Owned . Rented Total 


Exclusive use of toilet Yes No Yes No 
141 6 109 34 290 


(a) For families who rent, estimate the percentage in the area with exclusive 
use of an indoor toilet and give the standard error of your estimate; (b) estimate 
the total number of renting families in the area who do not have exclusive indoor 
toilet facilities and give the standard error of this estimate. 


3.7 If, in example 3.6, the total number of renting families in the city area 
is 7526, make a new estimate of the number of renters without exclusive toilet 
facilities and give the standard error of this estimate. 


3.8 For estimating the total number of units in class C in domain 1 (section 
3.10), the estimate 4, — N;p, was recommended if N, were known, as against 
Ay’ = Naj[n if N; were not known. Ignoring the fpc, show that in large samples 
the ratio of the variance of A, to that of Aj’ is approximately Q,/(Q, + P17), 
where v is the proportion of the population that is not in domain 1, and P}, as 
in section 3.10, is the proportion of the units in domain 1 that fall in class C. 


State the conditions under which knowledge of N, produces large reductions in 
variance. 


3.9 


3.9 In a simple random sample of size 5 from a population of size 30, no 
units in 


the sample were in class C. By the hypergeometric distribution, find the 


“taken from a population in which th 
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upper limit to the number A of units in class C in the population, corresponding 
to a one-tailed confidence probability of 95%. Find also the approximation to 
Ay obtained by computing the upper 95% binomial limit Py and shortening the 
interval as described in section 3.6. Try also the method on p. 59. 

3.10 A student health service has a record of the total number of eligible 
students N and of the total number of visits Y made by students during a year. 
Some students made no visits. The service wishes to estimate the mean number 
of visits Y/N; for the N; students who made at least one visit, but does not know 
the value of N,. A simple random sample of n eligible students is taken. In it 
n; students out of the n made at least one visit and their total number of visits 
was y. Ignore the fpc in this question. (a) Show that y/n, is an unbiased estimate 
of Y/N, and that its conditional variance is S?/n,, where S? is the variance of the 
number of visits among students making at least one visit. (6) A second method 
of estimating Y/N, is to use Ñ, = Nn,/nas an estimate of N, and hence Yn[Ni as 
an estimate of Y/N,. Show that this estimate is biased and that the ratio of the 
bias to the true value Y/N, is approximately (N — N,)/nN,. Find an approximate 
expression for the variance of the estimate Yn/Nn, and show that the estimate in 


(a) has a higher variance if 7 
(N — Nom ( Y. 
S Ee UNE 


Hint. If p is a binomial estimate of P, based on z trials, then approximately 


1 iSO 1 Q 

e(}) -pta IO) “ae 
3.11 Which of the two previous estimates seems more precise in the following 
circumstances? N = 2004, Y = 3011. The sample with n = 100 showed that 
73 students made at least one visit. Their total number of visits was 152 and the 

estimated variance s? was 1.55. 

3.12 A simple random sample of n cluster units, each with m elements, is 
e proportion of elements in class C is P. 
As the intracluster correlation varies, what are the highest and lowest possible 
values of the true variance of p (the sample estimate of P) and how do they com- 
pare with the binomial variance? Ignore the fpc. 
3.13 For the sample of 30 households in Table 3.5, the data shown (p. 70) 
refer to visits to the dentist in the last year. Estimate the variance of the 
proportion of persons who saw a dentist, and compare this with the binomial 


estimate of the variance. 
3.14 In sampling for a rare attribute, one method is to continue drawing a 


simple random sample until m units that possess the rare attribute have been 
found (Haldane, 1945) where m is chosen in advance. If the fpc is ignored, prove 
that the probability that the total sample required is of size 7 is 


(n — D! n-m 

Iu opt (n > m) 

(m — D! (n — m)! Q 

where P is the frequency of the rare attribute. Find the average size of the total 
sample and show that p = (m — Diin — 1) is an unbiased estimate of P. (For 
further discussion, see Finney, 1949, and Sandelius, 1951, who considers a plan 
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Number Dentist Seen Number Dentist Seen 
of of 
Persons Yes No Persons Yes No 
5 1 4 5 1 4 
6 0 6 4 4 0 
3 1 2 4 1 3 
3 2 1 3 1 2 
2 0 2 3 0 B 
3 0 3 4 1 3 
3 1 2 3 0 3 
3 1 2 3 1 2 
4 1 3 1 0 1 
4 0 4 2 0 2 
3 1 2 4 0 4 
2 0 2 3 1 2 
7 2 5 4 1 3 
4 1 3 2 0 2 
3 0 3 4 0 4 


in which sampling continues until either m has been found 
size has reached a preassigned limit D.) 


or the total sample 
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CHAPTER 4 


The Estimation of Sample Size 


4.1 A HYPOTHETICAL EXAMPLE 


In the planning of a sample survey, a stage is always reached at which a 
decision must be made about the size of the sample. The decision is impor- 
tant. Too large a sample implies a waste of resources, and too small a 
sample diminishes the utility of the results. The decision cannot always be 
made satisfactorily, for often we do not possess enough information to be 
sure that our choice of sample size is the best one. Sampling theory pro- 
vides a framework within which to think intelligently about the problem. 

A hypothetical example brings out the steps involved in reaching a solu- 
tion. An anthropologist is preparing to study the inhabitants of some 
island. Among other things, he wishes to estimate the percentage of inhab- 
itants belonging to blood group O. Cooperation has been secured so that 
x is feasible to take a simple random sample. How large should the sample 

e? 

This question cannot be discussed without first receiving an answer to 
another question. How accurately does the anthropologist wish to know 
the percentage of people with blood group O? In reply he states that he 
will be content if the percentage is correct within +5% in the sense that, 
if the sample shows 43% to have blood group O, the percentage for the 
whole island is sure to lie between 38 and 48. j 

To avoid misunderstanding, it may be advisable to point out to the 
anthropologist that we cannot absolutely guarantee accuracy within 5% 
except by measuring everyone. However large n is taken, there is a chance 
of a very unlucky sample that is in error by more than the desired 5%. 
The anthropologist replies coldly that he is aware of this, that he is willing 
to take a 1 in 20 chance of getting an unlucky sample, and that all he asks 
for is the value of n instead of a lecture on statistics. 

We are now in a position to make a rough estimate of n. To simplify 
matters, the fpc is ignored, and the sample percentage p is assumed to be 
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normally distributed. Whether these assumptions are reasonable can be 
verified when the initial n is known. 

In technical terms, p is to lie in the range (P + 5), except for a 1 in 20 
chance. Since p is assumed normaily distributed about P, it will lie in the 
range (P + 26,), apart from a 1 in 20 chance. Further, 


d= V PQ|n 


Hence, we may put 


/PQ]n = wee 
2v PQ/n = 5 or n 25 
At this point a difficulty appears that is common to all problems in the 
estimation of sample size. A formula for n has been obtained, but 7 
depends on some property of the population that is to be sampled. In this 
instance the property is the quantity P which we would like to measure. 
We therefore ask the anthropologist if he can give us some idea of the likely 
value of P. He replies that from previous data on other ethnic groups, and 


from his speculations about the racial history of this island, he will be 
surprised if P lies outside the range 30 to 60%. 


This information is sufficient to provide a usable answer. For any value 
of P between 30 and 60, the product PQ lies between 2100 and a maximum 


of 2500 at P = 50. The corresponding n lies between 336 and 400, To be 
on the safe side, 400 is taken as the initial estimate of n. 


The assumptions made in this analysis can now be re-examined. With 
n = 400 and a P between 30 and 60, the distribution of p should be close 
to normal. Whether the fpc is required depends on the number of people 
on the island. If the population exceeds 8000, the sampling fraction is less 
than 5% and no adjustment for fpc is called for. The method of applying 
the readjustment, if it is needed, is discussed in section 4.4. 


4.2 ANALYSIS OF THE PROBLEM 


The principal steps involved in the choice of a sample size are as follows: 


1. There must be some statement concerning what is expected of the 
sample. This statement may be in terms of desired limits of error, as in the 
previous example, or in terms of some decision that is to be made or action 
that is to be taken when the sample results are known. The responsibility 
for framing the statement rests primarily with the persons who wish to use 
the results of the survey, though they frequently need guidance in putting 
their wishes into numerical terms. - 

2. Some equation which connects n with the desired precision of the 
sample must be found. The equation will vary with the content of the 
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statement of precision and with the kind of sampling that is contemplated. 
One of the advantages of probability sampling is that it enables this equa- 
tion to be constructed. 

3. This equation will contain, as parameters, certain unknown prop- 
erties of the population. These must be estimated in order to give specific 
results. 

4. It often happens that data are to be published for certain major sub- 
divisions of the population and that desired limits of error are set up for 
each subdivision. A separate calculation is made for the n in each sub- 
division, and the total n is found by addition. 

5. More than one item or characteristic is usually measured in a sample 
survey: sometimes the number of items is large. If a desired degree of 
precision is prescribed for each item, the calculations lead to a series of 
conflicting values of n, one for each item. Some method must be found for 
reconciling these values. 

6. Finally, the chosen value of z must be appraised to see whether it is 
consistent with the resources availableto take the sample. This demands an 
estimation of the cost, labor, time, and materials required to obtain the 
proposed size of sample. It sometimes becomes apparent that 7 will have 
to be drastically reduced. A hard decision must,then be faced —Wwhether to 
proceed with a much smaller sample size, thus reducing precision, or to 
abandon efforts until more resources can be found. 


In succeeding sections some of these questions are examined in more 

detail. 
43 THE SPECIFICATION OF PRECISION 

d may be made by giving the amount of 
error that we are willing to tolerate in the sample estimates. This amount is 
determined, as best we can, in the light of the uses to which the sample 
results are to be put. Sometimes it is difficult to decide how much error 
Should be tolerated, particularly when the results have several different uses. 
Suppose that we asked the anthropologist why he wished the percentage 
with blood group O to be correct to 5% rather than, say, 4 or 654 He 
might reply that the blood group data are to be used primarily for racial 
classification. He strongly suspects that the islanders belong either to a 
racial type with a P of about 35% or to one with a P of about 50%. An 
error limit of 5% in the estimate seemed to him small enough to permit 
classification into one of these types. He would, however, have no violent 
objection to 4 or 6% limits of error. 


Thus the choice of a 5 % limit of error by the anthropologist was to some 
e example is typical of the way in which a 


extent arbitrary. In this respect thi : 
limit of error is often decided on. In fact, the anthropologist was more 


The statement of precision desire 
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certain of what he wanted than many other scientists and administrators 
will be found to be. When the question of desired degree of precision is 
first raised, such persons may confess that they have never thought about it 
and have no idea of the answer. My experience has been, however, that 
after discussion they can frequently indicate at least roughly the size of a 
limit of error that appears reasonable to them. 

Further than this we may not be able to go in many practical situations. 
Part of the difficulty is that not enough is known about the consequences of 
errors of different sizes as they affect the wisdom of practical decisions that 
are made from survey results. Even when these consequences are known, 
however, the results of many important Surveys are used by different people 

_ for different purposes, and some of the purposes are not foreseen at the 
time when the survey is planned. Therefore, an element of guesswork is 


likely to be prominent in the specification of precision for some time to 
come. 


If the sample is taken for a very Specific purpose, for example, for making 
a single “yes” or “no” decision or for deciding how much money to spend 
on a certain venture, the precision needed can usually be stated in a more 
definite manner, in terms of the consequences of errors in the decision. A 
general approach to problems of this type is given in section 4.9, which, 
although in need of amplification, offers a logical start on a solution. 


44 THE FORMULA FOR n IN SAMPLING FOR 
PROPORTIONS 


The units are classified into two classes, C and C'. Some margin of error 
d in the estimated proportion p of units in class C has been agreed on, and 


there is a small risk « which we are willing to incur that the actual error is 
larger than d; that is, we want 


Prp—P|24)—a 


Simple random sampling is assumed, and 
tributed. From theorem 3.2, section 3.2, 


dene recon: PQ 
N—1 n 


Hence the formula that connects n with the desir 


d=t N-—n [PQ 
N-1N m 


p is taken as normally dis- 


ed degree of precision is 
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where t is the abscissa of the normal curve that cuts off an area « at the tails. 
Solving for n, we find 


ÜPQ 

a 
Sa 4.1 

sl d 


For practical use, an advance estimate p of P is substituted in this formula. 
If N is large, a first approximation is 


2 
_ pq _ pd (42) 


where 


2 . 
yz 2 = desired variance of the sample proportion 
t 


If n/N is negligible, mp is a satisfactory 


In practice we first calculate 7. : 
comparison of 


approximation to the n of (4.1). If not, it is apparent on 
(4.1) and (4.2) that n is obtained as 


ee o (4.3) 
t= TPG INT Le (IN) 


Example. In the hypothetical blood groups example we had 


d-005 p=05, %=0.05, t=2 


_ 4)(0.5)00-5) _ 400 
™ = ~ (0.0025) 


Thus 


Let us assume that there are only 3200 people on the island. The fpc is needed, 


and we find bab 
= 356 


No SS aa: 
MIO = 
n-ic(m DN 1 + soo 


The formula for n, holds also if d, p, and q are all expressed as percentages 
instead of MORAN Since the ss pq increases as p moves are 4, or 
50%, a conservative estimate of 7 is obtained by choosing for p the value nearest 
to } in the range in which p is thought likely to lie. If p seems likely to lie between 
5 and 9%, for instance, we assume 9% for the estimation of n. 


45 THE FORMULA FOR » WITH CONTINUOUS DATA 


If 7 is the average of the observations from a simple random sample, we 


wish to have e 
pr(qg— YI2Z0=*% 
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where d is the chosen margin of error and « a small probability. We 


assume that y is normally distributed: from theorem 2.2, corollary 1, its 
standard error is 


DR N—n ES 
N Jn 
Hence 
dee RS. (4.4) 
N Jn 
This gives 


1 (tS? 
n 
N\d 
As in the preceding section, we take as a first approximation 


n= (S-S (4.5) 


This is adequate unless n/N is appreciable, in which event we compute 7 as 


No 
1+ (n/N) 


n= 


(4.6) 


If the population total Y is to be estimated with margin of error d, take as a 
first approximation 


(“Sy (NS)? 
n = [——|2-—— 
d V 


instead of (4.5). Equation 4.6 remains unchanged. 


Example. |n nurseries that produce young trees for sale it is advisable to 
estimate, in late winter or early spring, how many healthy young trees are likely 
to be on hand, since this determines policy toward the solicitation and acceptance 
of orders. A study of sampling methods for the estimation of the total numbers 
of seedlings was undertaken by Johnson (1943). The data that follow were | 
obtained from a bed of silver maple seedlings, 1 ft wide and 430 ft long. The 
sampling unit was 1 ft of the length of the bed, so that N — 430. By complete 


enumeration of the bed it was found that Y = 19, S2 = 85.6, these being the 
true population values. 


With simple random sampling, how many units must be taken to estimate Y 
within 1097 


^, apart from a chance of 1 in 20? From (4.5) we obtain 


a = ËS? 4S9 _ 
o di = aU. gee = 95 
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Since n/N is not negligible, we take 
PRL OE 
BELT 

Almost 20% of the bed has to be counted in order to attain the precision 
desired. 

The formulas for n given here apply only to simple random sampling in 
which the sample mean is used as the estimate of Y. The appropriate for- 
mulas for other methods of sampling and estimation are presented with the 
discussion of these techniques. 


n = 78 


4.6 ADVANCE ESTIMATES OF POPULATION VARIANCES 


The nursery example is atypical in that the population variance S? was 
known. In practice, there are four ways of estimating population vari- 
ances for sample size determinations: (1) by taking the sample in two 
steps, the first being a simple random sample of size n, from which the value 
of S? or P and the required n will be obtained; (2) by the results of a pilot 
survey; (3) by previous sampling of the same or a similar population; and 
(4) by guesswork about the structure of the population, assisted by some 


mathematical results. 
Method 1 gives the most reliable estimates of S? or P, but it is not often 


used since it slows up the completion of the survey. When the method is 
feasible, Cox (1952), following work by Stein (1945), shows how to compute 
n from A,? or p, so that the final estimate y or p will have a preassigned 
variance V, a preassigned limit of error d, or a preassigned coefficient of 
variation. The first sample is assumed large enough to neglect terms of 
order 1/n,?. A few results are quoted. 


Estimation of Y with Variance V 

If 5? is the variance from the first sample, take additional units to make 
the total sample size 1 

n- “(1 + 2) (4.7) 
4 ny 

The distribution of y is assumed to be approximately normal. If S were 
known exactly, the required sample size would be S?/V. The effect of not 
knowing S is to increase the average size by the factor (1 + 2[n;). 


Estimation of P with variance V 
Let p, be the estimate of P from the first sample. The combined size of 
the first two samples should be 
Q2 B5, 3-858 4 1—3n4 (48) 
y: Didi Vny 
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The first term on the right is the size required if P is known to be equal to 
pi With this method, the ordinary binomial estimate p made from the 
complete sample of size n is slightly biased. To correct for bias, take 


po py VA=®) 
Pq 
Estimation of P with given cv = VC 
Take 
ned na (4.9) 


Cp Ph Cpm 


The estimate is Ê = p — Cp/q. In all three results given above the fpc is 
ignored. 


Example. A sampler wishes to estimate P with a coefficient of variation of 
0.1 (10%). He guesses that P will lie somewhere between 5 and 20%. This range 
is too wide to give a good initial estimate of the required n. Since the cv of P is 


v Q/nP, it is easily verified that n = 400 is adequate for P = 20 %, but 2 = 1900 
will be needed if P is only 5%. 


Accordingly, he takes an initial sample with n, = 400 and finds p, = 0.105. 
Since VC = 0.1, C = 0.01. Equation 4.9 gives 
LUE) NT T MIN. NR 
(0.070.105) ` (0.0940) ` (0.0))(42) - 
The combined sample gives np = 88; p = 88/925 = 0.0951. The correction 
for bias, Cp/q, amounts to 0.0011, giving a final estimate of 0.0940 or 9.4%. 


The second method, a small pilot survey, serves many purposes, 
especially if the feasibility of the main survey is in doubt. If the pilot sur- 
vey is itself a simple random sample, the preceding methods apply. But 
often the pilot work is restricted to a part of the population that is con- 
venient to handle or that will reveal the magnitude of certain problems. 
Allowance must be made for the selective nature of the pilot when using 
its results to estimate S?orP. For instance, a common practice is to confine 
the pilot work to a few clusters of units. Thus the computed s? measures 
mostly the variation within a cluster and may be an underestimate of the 
relevant S? The relation between intra- and intercluster variation is dis- 
cussed in Chapter 9. The same problem arises in cluster sampling for pro- 
portions, in which the formula pq/n may underestimate the effect of 
variation among clusters. Cornfield (1951) gives a good illustration of 
the estimation of sample size in cluster sampling for proportions. 

Method 3—the use of results from previous surveys—points to the value 
of making available, or at least keeping accessible, any data on standard 
deviations obtained in previous surveys. Unfortunately, the cost of com- 
puting standard deviations in complex surveys is high, even with electronic 
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machines, and frequently only those s.d.’s needed to give a rough idea of the 
precision of the principal estimates are computed and recorded. If suitable 
past data are found, the value of S? may require adjustment for time 
changes. With skew data in which Y is changing with time, S?is often found 
to change at a rate lying somewhere between kY and kY?, where k isa 
constant. Thus, if Y is thought to have increased by 10% in the time 
interval since the previous survey, we might increase our initial estimate of 
S? by 10 to 20%. 

Finally, a serviceable estimate of S? can sometimes be made from 
relatively little information about the nature of the population. In early 
studies of the numbers of wireworms in soils, a tool was used to take a 
sample (9 x 9 x 5 in.) of the topsoil. For estimating n, the sampler 
needed to know the standard deviation of the number of wireworms found 
in a boring with the tool. If wireworms were distributed at random over 
the topsoil, the number found in a small volume would follow the Poisson 
distribution, for which S? — Y. Since there might be some tendency for 
it was decided to assume S? 1.2Y, the 
factor 1.2 being an arbitrary safety factor. Although Y was not known, 
the values of Y that are of economic importance with respect to crop 
damage could be delineated. These two pieces of information made it 
possible to determine sample sizes that proved satisfactory. P ! 

Deming (1960) shows how some simple mathematical distributions may 
be used to estimate S? from a knowledge of the range and a general idea of 
the shape of the distribution. If the distribution is like a binomial, with a 
Proportion p of the observations at one end of the range and a proportion q 
at the other end, S? = pgh®, where h is the range. When p= 7=+, the 
value of S? = 0.25h? is the maximum possible for a given range h. Other 
useful relations are that S? = 0.0834? for a rectangular distribution, Sa 
0.0567? for a distribution shaped like a right triangle, and S? = 0.042h? for 
an isosceles triangle. í H 

These relations do not help much if A is large or poorlyknown. However, 
if h is large, good sampling practice is to stratify the population (Chapter 
5) so that within any stratum the range 1S much reduced. | Usually the 
Shape also becomes simpler (closer to a rectangular) gun a stratum. 
Consequently, these relations are effective in predicting $°, hence n, within 


individual strata. 


Wireworms to congregate, 


4.7 SAMPLE SIZE WITH MORE THAN ONE ITEM 


In most surveys information is collected on more than one item. One 


method of determining sample size is to specify margins of error for the 
items that are regarded as most vital to the survey. An estimation of the 
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sample size needed is first made separately for each of these important 
items. 

When the single-item estimations of n have been completed, it is time to 
take stock of the situation. It may happen that the n’s required are all 
reasonably close. If the largest of the n’s falis within the limits of the 
budget, this n is selected. More commonly, there is a sufficient variation 
among the n’s so that we are reluctant to choose the largest, either from 
budgetary considerations or because this will give an over-all standard of 
precision substantially higher than originally contemplated. In this event 
the desired standard of precision may be relaxed for certain of the items, in 
order to permit the use of a smaller value of n. 

In some cases the n’s required for different items are so discordant that 
certain of them must be dropped from the inquiry, for with the resources 
available the precision expected for these items is totally inadequate. The 
difficulty may not be merely one of sample size. Some items call for a 
different type of sampling from others. With populations that are sampled 
repeatedly, it is useful to amass information about those items that can be 
combined economically in a general survey and those that necessitate 
special methods. As an example, a classification of items into four types, 


TABLE 4.1 


AN EXAMPLE OF DIFFERENT TYPES OF ITEM IN 
REGIONAL SURVEYS 


Type Characteristics of Item Type of Sampling Needed 


1 Widespread throughout the region, occur- A general survey with low 
ring with reasonable frequency in all sampling ratio. 


parts. 
2 Widespread throughout the region but with A general survey, but with 
low frequency. a higher sampling ratio. 


3 Occurring with reasonable frequency in For best results, a stratified 
most parts of the region, but with more sample with different in- 
sporadic distribution, being absent in tensities in different parts 
some parts and highly concentrated in of the region (Chapter 5). 
others. Can sometimes be in- 

cluded in a general survey 
with supplementary sam- 


ling. 
4 Distribution very sporadic or concentrated Not suitable for a gen- 
in a small part of the region. eral survey. Requires a 
sample geared to its dis- 
tribution. 


PU MENU. 
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suggested by experience in regional agricultural surveys, is shown in Table 
4.1. In this classification, a general survey means one in which the units are 
fairly evenly distributed over some region, as for example by a simple 
random sample. 


4.8 SAMPLE SIZE WHEN ESTIMATES ARE 
WANTED FOR SUBDIVISIONS OF THE POPULATION 


It is often planned to present estimates not only for the population as a 
whole but for certain subdivisions. If these can be identified in advance, as 
with different geographical regions, a separate calculation of » is made for 
each region. Suppose that the mean of each subdivision is to be estimated 
with a specified variance V. For the ith subdivision, we have n; = SV, 
so that the total sample size n = Y S/V. The individual S? will, on the 
average, be smaller than S?, the population variance, but often they are 
only slightly smaller. Thus, if there are k subdivisions, n = kS?/V, whereas 
if only the estimate for the population as a whole were wanted we would 
take n = S?/V. 

Thus if estimates with variance V are wanted for each of k subdivisions 
the sample size must be roughly k times as large as is needed for an over-all 
estimate of the same precision. This point tends to be overlooked in calcu- 
lations of sample size by persons inexperienced in survey methods. 

If the subdivisions represent classifications by variables such as age, sex, 
income, and years of schooling, the subdivision to which a person belongs 
is not known until the sample has been taken. Advance sample size esti- 
mates can still be made if the proportions 7; of the units that belong to the 
various subdivisions are known. If a simple random sample of size z is 
selected, the expected size of sample from the ith subdivision is n7; The 
average variance of the mean from this subdivision is 

Lee Sade 
vao = E) = 2 


if n7; is large. Hence we require n = S2/7,V in order to make V(y,) = V. 


If this is to hold for every subdivision, 


: Si) zi max (=) 
n = max = V E. 


If the subdivisions are approximately equal in size, 7; = 1/k, but the factor 
max (1/7,) can be considerably larger than k if some subdivisions are rare. 
In this event, we may either have to increase the value of Vin this sub- 
division or find some way of identifying units in rare subdivisions in ad- 
vance so that they can be sampled at a higher rate. The method of double 


Sampling (Chapter 12) is sometimes useful for this purpose. 
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The demands on sample size are still greater in analytical studies in which 
the specifications are 


VG:- HSV 
for every pair of subdivisions (domains). In this case 
2 2 
n= max (BE 4 S2) 
ij Vm m, 


If the S? are not very different from S?, n will be 2kS?/V when the k 
domains are of equal size, and stiil greater otherwise. The effect of fpc 
terms, neglected in this discussion, is to reduce the required n’s to some 
extent. 


49 SAMPLE SIZE IN DECISION PROBLEMS 


A more logical approach to the determination of sample size can some- 
times be developed when a practical decision is to be made from the results 
of the sample. The decision will presumably be more soundly based if the 
sample estimate has a low error than if it has a high error. We may be able 
to calculate, in monetary terms, the loss /(z) that will be incurred in a 
decision through an error of amount z in the estimate. Although the actual 
value of z is not predictable in advance, sampling theory enables us to find 
the frequency distribution f(z, n) of z, which for a specified sampling 
method will depend on the sample size n. Hence the expected loss for a 
given size of sample is 


L(n) = [rere. n) dz 


The purpose in taking the sample is to diminish this loss. If C(n) is the 
cost of a sample of size n, a reasonable procedure is to choose n to minimize 


Cn) + L(») 


since this is the total cost involved in taking the sample and in making 
decisions from its results. The choice of n determines both the optimum 
size of sample and the most advantageous degree of precision. 

Alternatively, the same approach can be presented in terms of the 
monetary gain that accrues from having the sample information, rather 
than in terms of the loss that arises from errors in the sample information. 
If monetary gain is used, we construct an expected gain G(n) from a sample 
of size n, where G(n) is zero if no sample is taken. We maximize 


G(n) — CQ 


In this form the principle is equivalent to the rule in classical economics 
that profit is to be maximized. 
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The simplest application occurs when the loss function, /(z), is 422, where 
2 is a constant. It follows that 


L(n) = AE (2) 
For instance, if Y is the sample estimate of Y, and z = Y — Y, 


s AS? As? 
LO V0) 


if simple random sampling is used. 
The simplest type of cost function for the sample is 


C(n) = co + cn 


where c, is the overhead cost. By differentiation, the value of n which 


minimizes cost plus loss is 
n= V AS?lc, 


A more general form of this result is given by Yates (1960). The same 
analysis applies to any method of sampling and estimation in which the 
variance of the estimate is inversely proportional to n and the cost is a 
linear function of n. ^ "Ln 

Blythe (1945) describes the application of this principle to the estimation 
of the volume of timber in a lot for selling purposes (see exercise 4.11). 
Nordin (1944) discusses the optimum size of sample for estimating poten- 
tial sales in a market which a manufacturer intends to enter. If the sales 
can be forecast accurately, the amount of fixed equipment and the pro- 
duction per unit period can be allocated to maximize the manufacturer s 
expected profit. Grundy et al. (1954, 1956) consider the optimum size of a 
second sample when the results of a first sample are already known. 

This approach has received substantial further development from workers 
on statistical decision theory. Generalizations include the substitution of 
utility for money value as a scale on which to measure costs and losses, the 
explicit use of subjective prior information about unknown parameters by 
expressing this information as “prior” probability distributions of the 
unknown parameters, and the investigation of different types of cost and 
loss functions and of qualitative as well as quantitative data. For a com- 
Prehensive account of the method, see Raiffa and Schlaifer (1961). Al- 
though it is still not evident how frequently decision problems will be 
amenable to complete solution by this approach, the method has value in 
stimulating clear thinking about the important factors in a good decision, 
One area that appears suitable for applications is the sampling of lots of 
articles in a mass-production process in order to decide whether to accept 
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or reject the lot on the basis of its estimated quality. Sittig (1951) con- 
siders the economics of sample-size determination, taking account of costs 
of inspection and the costs incurred through defective articles in accepted 
lots and good articles in rejected lots. 


EXERCISES 


4.1 Ina district containing 4000 houses the percentage of owned houses is to 
be estimated with a s.e. of not more than 2%,and the percentage of two-car 
households with a s.e. of not more than 1%. (The figures 2 and 1% are the 
absolute values, not the cv’s.) The true percentage of owners is thought to lie 
between 45 and 65% and the percentage of two-car households between 5 and 
10%. How large a sample is necessary to satisfy both aims? 

4.2 In the population of 676 petition sheets (Table 2.1, page 27) how large 
must the sample be if the total number of signatures is to be estimated with a 
margin of error of 1000, apart from a 1 in 20 chance? Assume that the value of 
s? given on page 27 is the population S?. 

4.3 A survey is to be made of the prevalence of the common diseases in a 
large population. For any disease that affects at least 1% of the individuals in 
the population, it is desired to estimate the total number of cases, with a coefficient 
of variation of not more than 20%. (a) What size of simple random sample is 
needed, assuming that the presence of the disease can be recognized without 
mistakes? (b) What size is needed if total cases are wanted separately for males 
and females, with the same precision? 

44 In a wireworm survey the number of wireworms per acre is to be esti- 
mated with a limit of error of 30%, at the 95% probability level, in any field in 
which wireworm density exceeds 200,000 per acre in the top 5 in. of soil. The 
sampling tool measures 9 x9 x 5 in. deep. Assuming that the number of 
wireworms in a single sample follows a distribution slightly more variable than 
the Poisson, we take S? = 1.2Y. What size of simple random sample is needed ? 
(1 acre — 43,560 sq ft.) 

4,5 The following coefficients of variation per unit were obtained in a farm 
survey in Iowa, the unit being an area 1 mile square (data of R. J. Jessen): 


Estimated cv 


Item (%) 
2 ie eS eee Se 

Acres in farms 38 
Acres in corn 39 
Acres in oats 44 
Number of family workers 100 
Number of hired workers 110 
Number of unemployed 317 


A survey is planned to estimate acreage items with a cv of 24% and numbers of 
workers (excluding unemployed) with a cv of 5%. With simple random sampling, 
how many units are needed? How well would this sample be expected to estimate 
the number of unemployed? 
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4.6 By experimental sampling, the mean value of a random variate is to be 
estimated with variance V = 0.0005. The values of the random variate for the 
first 20 samples drawn are shown below. How many more samples are needed ? 
(Use equation 4.7.) 


Sample Value of Sample Value of 
Number Random Variate Number Random Variate 
GEO See ea 7 o 0-7] 
1 0.0725 11 0.0712 
2 0.0755 12 0.0748 
3 0.0759 13 0.0878 
4 0.0739 14 0.0710 
3) 0.0732 15 0.0754 
6 0.0843 16 0.0712 
7 0.0727 17 0.0757 
8 0.0769 18 0.0737 
9 0.0730 19 0.0704 
10 0.0727 20 0.0723 


ed to estimate the proportion of families 
possessing certain attributes. For the principal items of interest, the value of P 
is expected to lie between 30 and 707%. With simple random sampling, how 
large are the valuesof necessary to estimate the following means with a standard 
error not exceeding 3%? (a) the over-all mean P. (b) the individual means P; 
for the income classes—under $5000; $5000 to $10,000; over $10,000. 
(i = 1, 2, 3). (c) the differences between the means (P; — P;) for every pair of 
the classes in (b). Give a separate answer for (a), (b), and (c). Income statistics 
indicate that the proportions of families with incomes in the three classes above 
are 50, 38, and 12%. 

4.8 The four-year colleges in the Unite 
four different sizes according to their 1952- 
deviations within each class are shown below. 


4.7 A household survey is design 


d States were divided into classes of 
1953 enrollments. The standard 


Class 


1 2 3 4 


1000-3000 3000-10,000 over 10,000 


Numbe 1000 
r of students < 2008 10,023 


S 236 625 
If you know the class boundaries but not the values of S, how well can you 
guess the S values by using simple mathematical figures (section 4.6)? No 
College has less than 200 students ‘and the largest has about 50,000 students. 
4.9 With a quadratic loss function and a linear cost function, as in section 
4.9, S? is reduced to S? by a superior sampling plan, co C1, and 4 remaining 
unchanged. If n', V' denote the new optimum sample size and the accompanying 
1 = 
V(Y), show that n < n and that V’ < V, provided that n = N/2. 
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4.10 If the loss function due to an error in Ẹ is A|] — Y|and if the cost 
C = c + cın, show that with simple random sampling, ignoring the fpc, the 
most economical value of n is 

AS YA 
x) 


4.11 (Adapted from Blythe, 1945) The selling price of a lot of standing 
timber is UW, where U is the price per unit volume and W is the volume of 
timber on the lot. The number N of logs on the lot is counted, and the average 
volume per log is estimated from a simple random sample of n logs. The estimate 
is made and paid for by the seller and is provisionally accepted by the buyer. 
Later, the buyer finds out the exact volume purchased, and the seller reimburses 
him if he has paid for more than was delivered. If he has paid for less than was 
delivered, the buyer does not mention the fact. 

Construct the seller's loss function. Assuming that the cost of measuring 7 
logs is cn, find the optimum value of n. The standard deviation of the volume 
per log may be denoted by S and the fpc ignored. 
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Stratified Random Sampling 


5.1 DESCRIPTION 


on of N units is first divided into sub- 
respectively. These subpopulations 
the whole of the popu- 


In stratified sampling the populati 
populations of Ny, Ns, * ^; N, units, 
are nonoverlapping, and together they comprise 
lation, so that 
N+M t tN =N 
strata. To obtain the full benefit from strati- 
must be known. When the strata have been 
h, the drawings being made inde- 
le sizes within the strata-are de- 


The subpopulations are called 
fication, the values of the N, 
determined, a sample is drawn from eac 
pendently in different strata. The samp 
noted by m, ns, * * * Mz, respectively. 
If a simple random sample is taken in eac 
is described as stratified random sampling. 
Stratification is a common technique. There are 
the principal ones are the following: 
re wanted for certain subdivisions of the 
each subdivision as a “population” in its 


h stratum, the whole procedure 


many reasons for this; 


1. If data of known precision a 
population, it is advisable to treat 
own right. 

2. Administrative con 
for example, the agency conductin € 
of which can supervise the survey for a part of the population. 

3. Sampling problems ‘may differ markedly in different parts of the 
Population. With human populations, people living in institutions (e.g., 
hotels, hospitals, prisons) are often placed in a different stratum from 


People living in ordinary homes because à different approach to the samp- 
tuations. In sampling businesses we may 


ling is appropriate for the two si i 

possess a list of the large firms, which are placed in a separate stratum. 

Some type of area sampling may have to be used for the smaller firms. 
87 


venience may dictate the use of stratification; 
g the survey may have field offices, each 
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4. Stratification may produce a gain in precision in the estimates of 
characteristics of the whole population. It may be possible to divide a 
heterogeneous population into subpopulations, each of which is internally 
homogeneous. This is suggested by the name strata, with its implication of 
a division into layers. If each stratum is homogeneous, in that the meas- 
urements vary little from one unit to another, a precise estimate of any 
stratum mean can be obtained from a small sample in that stratum. These 
estimates can then be combined into a precise estimate for the whole 
population. 3 


The theory of stratified sampling deals with the properties of the esti- 
mates from a stratified sample and with the best choice of the sample sizes 
n, to obtain maximum precision. In this development it is taken for 
granted that the strata have already been constructed. The problems of 
how to construct strata and of how many strata there should be are 
postponed to a later stage (section 5A.6). 


5.2. NOTATION 


The suffix h denotes the stratum and i the unit within the stratum. The 
notation is a natural extension of that previously used. The following 
symbols all refer to stratum h: 


N, total number of units 
Ny, number of units in sample 
Uhi value obtained for the ith unit 
W, = M stratum weight 
N 
h= 7 sampling fraction in the stratum 
h 
Na 
a » Uni 
Y, = =— true mean 
N, 
T 
> Yni 
Ia == sample mean 
nN, 
Na ^ 
A 2 ne — Yp) 
S, = == true variance 
: NIE 


Note that the divisor for the variance is (N, — 1). 
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5.3 PROPERTIES OF THE ESTIMATES 


For the population mean per unit, the estimate used in stratified sampling 
is Jy, (st for stratified), where 
L 
2 Na, A 
ya = — N (5.1) 


where N= N, + No + °°: + Ny. 
The estimate 7,, is not in general the same as the sample mean. The 


sample mean, J, can be written as 

75 

È nrin 

= (5.2) 
n 


y= 


The difference is that in 7,, the estimates from the individual strata receive 
their correct weights N,/N. It is evident that 7 coincides with 7,, provided 
that in every stratum 
m Nee oem mm 
n N N N 
This means that the sampling fraction is the same in all strata. This 
stratification is described as stratification with proportional allocation of 
the n,. It gives a self-weighting sample. If numerous estimates have to be 
made, a self-weighting sample is time-saving. 3» > 
The principal properties of the estimate 7, are outlined in the following 
theorems. The first two theorems apply to stratified sampling in general 
and are not restricted to stratified random sampling; that is, the sample 
from any stratum need not be a simple random sample. 
Theorem 5.1. If in every stratum the sample estimate 7, is unbiased, 
then g,, is an unbiased estimate of the population mean Y. 
Proof. 


or fue 


a =l s = 
E(Yst) = E* N xs N 
Since the estimates are unbiased in the individual strata. But the popu- 
lation mean Y may be written 
L Ny L hg 
> > Vni > NY, 
y- A-1iz1 ^L 
N N 
This completes the proof. 
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Corollary. Since Jp is an unbiased estimate of Y, for simple random 
sampling within strata, 7,, is an unbiased estimate of Y for stratified ran- 
dom sampling. 


Theorem 5.2. For stratified sampling, the variance of J, as an esti- 
mate of the population mean Y, is 


L 
2 NEV) p 
Vlla) = T — = » WV (Gn) (5.3) 
N n=l 
where 
VG) = EG, — Y, 
There are two restrictions on the theorem: (a) y, must be an unbiased 
estimate of Y,, and (b) the samples must be drawn independently in 
different strata. 


Proof. 
VRAC x Nas A 5 N;Y, 
N N 
NiAGn — Y, 
= 2 LL ^) (5.4) 


where the sum extends over all strata. Note that the error (J, — Y) 
in the estimate is now expressed as a weighted mean of the errors of 
estimation which have been made within the individual strata. Hence 
Gey = È NG — Yn)? i 2» NN (G, — Y) — Y) 
st N? M N? 
where the right-hand term extends over all pairs of strata. 

We now average over all possible samples. For any cross-product term, 
we begin by keeping the sample in stratum h fixed, and average over all 
samples in stratum j. Since sampling is independent in the two strata, the 
possible samples in stratum j will be the same and have the same proba- 
bilities, whatever sample has been drawn in stratum h. But since g; is 
assumed unbiased, the average of (7; — Y;) is zero. Hence all cross-prod- 
uct terms vanish. 

The squared terms give 


Va.) = > N E(D, m Y D N,?V(y,) 
(gu) = £—— m —2- - e 
N N? 
The important point about this result is that the variance of jj,, depends 
only on the variances of the estimates of the individual stratum means Y;. 
If it were possible to divide a highly variable population into strata such 


that all items had the same value within a stratum, we could estimate Y 
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without any error. Equation (5.4) shows that it is the use of the correct 
stratum weights M,/N in making the estimate 7,, that leads to this result. 


Theorem 5.3. For stratified random sampling, the variance of the 
estimate 7,, is 
" 1 L S L 2m S 
V(9s) = = XN, — n) ~ =E W> (1h) (5.5) 
N* n=1 n, hel Ny, 
Proof. Since y, is an unbiased estimate of Y,, theorem 5.2 can be 
applied. Further, by theorem 2.2, applied to an individual stratum, 


S Ng — Hs 
nr Ny 
By substitution into the result of theorem 5.2, we obtain 


VG) = 


Và) = AS MMH) INM- mE = SEA — 5 
Yst = Nee Yn Aie AM OM ee Oo h 
Some particular cases of this formula are given in the following corol- 
laries. 
Corollary 1. If the sampling fractions n;/N, are negligible in all strata, 


- 1 NS; WS, 

Vg) => > = pS (5.6) 
(CAD) N? " Xx n, 

This is the appropriate formula when finite population corrections can be 

ignored. 


Corollary 2. With proportional allocation, we substitute 


nN, 
RITE 
in (5.5). The variance reduces to 
Ju) = Mem on WS? 5.7 
K id ur ee 6.7) 


Corollary 3. If sampling is proportional and the variances in all strata 
have the same value, $,2, we obtain the simple result 
t Sze (A = ") 
v SA eens 5.8 
V (Fs) h N (5.8) 
Theorem 5.4. If f = Nj, is the estimate of the population total 
Y, then 


2 
(£j) =D Nal Nn — m) es (5.9) 
h 


This follows at once from theorem 5.3. 
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TABLE 5.1 
Sizes oF 64 Cities (iN 1000s) iN 1920 AND 1930 
1920 Size (z, ;) 1930 Size (y) 


Stratum Stratum 


797 314 172 121 900 364 209 113 
TI3 298 172 120 822 317 183 115 


Note. Cities are arranged in the same order in both years. 


Totals and sums of squares 


1920 1930 


» (Ehi) ») (2,2) >) (ni) 23 ne) 
8,349 4,756,619 10,070 — 7,145,450 
7,941 1,474,871 9,498 2,141,720 


Stratum 


1 
2 


Example. Table 5.1 shows the 1920 and 1930 numbers of inhabitants, in 
thousands, of 64 large cities in the United States. The data were obtained by 
taking the cities which ranked fifth to sixty-eighth in the United States in total 
number of inhabitants in 1920. The cities are arranged in two strata, the first 
containing the 16 largest cities and the second the remaining 48 cities. 

The total number of inhabitants in all 64 cities in 1930 is to be estimated from 
a sample of size 24. Find the standard error of the estimated total for (1) a 
simple random sample, (2) a stratified random sample with proportional 
allocation, (3) a stratified random sample with 12 units drawn from each stratum. 
, This population resembles the populations of many types of business enterprise 
in that some units—the large cities—contribute very substantially to the total 
and display much greater variability than the remainder. 
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The stratum totals and sums of squares are given under Table 5.1. Only the 
1930 data are used in this example: the 1920 data appear in a later example. 

For the complete population in 1930, we find 

Y = 19,568, S? = 52,448 

The three estimates of Y are denoted by Y,,,, Prog and Pequat- 

1. For simple random sampling, 
MEN — 64)*(52,448) (40 

SEN LU (en X ) — 5,594,453 


V( Pran) =e IN mc xS 
from theorem 2.2, corollary 2. The standard error is 
c(Y,,,) = 2365 
2. For the individual strata the variances are 
Sj? = 53,843, Sj = 5581 
Note that the stratum with the largest cities has a variance nearly 10 times that 


of the other stratum. 
In proportional allocation, we have my = 6, nj = 18. From (5.7), multiplying 


by N?, we have 
N-n 
4t Prop) = THe D NpShè 
42 [(16)(53,843) + (48)(5581)] = 1,882,293 
a( Pprop) = 1372 
3. For n — ny = 12 we use the general formula (5.9): 
S, 2 
V( Tua) = XN. — n) A 
ti DONE + eae) = 1,090,827 


ll 


(fua) = 1044 


In this example equal samp! 
proportional allocation. Both are great 


le sizes in the two strata are more precise than 
ly superior to simple random sampling. 


5.4 THE ESTIMATED VARIANCE AND 
CONFIDENCE LIMITS 


If a simple random sample is taken within each stratum, an unbiased 


estimate of S,? (from theorem 2.4) is 
T 
: D (Yn — Hr) (5.10) 


n, — lia 
Hence we obtain the following: 
Theorem 5.5. With stratified random sampling, an unbiased esti- 
mate of the variance of Fst is f 
: " à 4 Sp 
v(ys) = sys) = ai Nn — ny) a 


w= 


(5.11) 
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An alternative form for computing purposes is 


Á LwhN? Lys? 
SUID n 


A-1 Na ai N 


(5.12) 


The second term on the right represents the reduction due to the fpc. 
In order to compute this estimate, there must be at least two units drawn 
from every stratum. Estimation of the variance when stratification is car- 


ried to the point at which only one unit is chosen per stratum is discussed 
in section 5A.11. 


Corollary. In certain applications it is reasonable to suppose that 


S,? has the same value in all strata. From the analysis of variance of the . 


sample, a pooled estimate of this common variance is 


S = 


L n 
È X(w- H)? 
helical 

n—L 


Since sampling is usually proportional in this situation, the estimated 
variance of y,, takes the simple form (from theorem 5.3, corollary 3) 
2 
My em 
n N 


with n — L degrees of freedom. 


The formulas for confidence limits are as follows: 


Population mean: Gor + ts(y,) (5.13) 
Population total: Ny, + tNs(¥,,) (5.14) 


These formulas assume that 7,, is normally distributed and that s(7,;) iS 
well determined, so that the multiplier t can be read from tables of the 
normal distribution. 

If only a few degrees of freedom are provided by each stratum, the 
usual procedure for taking account of the sampling error attached to a 
quantity like s(Y,,) is to read the r-value from the tables of Student's t 
instead of from the normal table. The distribution of s(¥,,) is in general too 


complex to allow a strict application of this method. An approximate 
method of assigning an effective number of degrees of ; Jat) i 
follows (Satterthwaite, 1946): audeam tsi Se 


We may write 


y ea 
a) = M Px where g, = Na(Na — nj) 
= T 
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The effective number of degrees of freedom 7, is 


232 E 

n, = (X £v ) (5.15) 

>) En = 

The value of n, always lies between the smallest of the values (m, — 1) 
and their sum. The approximation takes account of the fact that Sj? 
may vary from stratum to stratum. It requires the assumption that the 
yj, are normal, since it depends on the result that the variance of s? is 
2c,4/(m, — 1). As shown in formula 2.51, page 43, the variance of s;? will 
be larger than this if the distribution of y»; has positive kurtosis. In this 
event, formula 5.15 overestimates the effective degrees of freedom. 


5.5 OPTIMUM ALLOCATION 


In stratified sampling the values of the sample sizes 7, in the respective 
strata are chosen by the sampler. They may be selected to minimize 
V(9j,)) for a specified cost of taking the sample or to minimize the cost for a 
specified value of V(J,;). 

The simplest cost function is of the form 


cost = C = Cy + Xe; (5.16) 


Within any stratum the cost is proportional to the size of sample, but the 
cost per unit c, may vary from stratum to stratum. The term c, represents 
an overhead cost. This cost function is appropriate when the major item 
of cost is that of taking the measurements on each unit. If travel costs 
between units are substantial, empirical and mathematical studies suggest 
that travel costs are better represented by the expression Xi, V/ny where t, 
is the travel cost per unit [Beardwood et al. (1959)]. Only the linear cost 


function (5.16) is considered here. 
Theorem 5.6. In stratified random sampling with a cost function of the 
form (5.16) the variance of the estimated mean ¥,; 1s a minimum when 7, 


is proportional to N, SalV Cre 
Proof. The problem is to minimize 


L Wis, 


L Ws L WS? 
vao-irra-m-r LEO 


-1 Ny, a Na 
subject to the restriction 


en Tout ott + crn = € — 6 


96 SAMPLING TECHNIQUES 


Using the calculus method of Lagrange multipliers, we select the n, and 
the multiplier 4 to minimize 
Vga) + A en, — C + co) 
26 2 26 2 
=Z y M Edu bandes + crnz — C + co) 
Ny h 


Differentiation with respect to n, gives the equations 


202 
A t= 0 (h-12--L) 
Ty 
that is, 
mA = PAS (5.17) 
c; 
Summing over all strata, we obtain j 
n = y 95 (5.18) 
Ch 


Finally, the ratio of (5.17) to (5.18) gives 
moo WAS en ——— NS e, 
n EOASJ Vo) ENSA o) 


This theorem leads to the following rules of conduct. Ina given stratum, 
take a larger sample if 


(5.19) 


1. the stratum is larger, 
2. the stratum is more variable internally, 
3. sampling is cheaper in the stratum. 


One further step is needed to complete the allocation. Equation (5.19) 
gives the n, in terms of n, but we do not yet know what value n has. The 
solution depends on whether the sample is chosen to meet a specified total 
cost C or to give a specified variance V for V. If cost is fixed, substitute the 
optimum values of n, in the cost function (5.16) and solve for n. This gives 


_ (C= WEN, S/V c) 
EN, S, o) 
If V is fixed, substitute the optimum n, in the formula for V(g,). We find 


(X WS, e.) X. W,Sy o, 
V+ QN) X W,S,* 


n 


n= 


where W, = N,/N. 
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An important special case arises if c, = c, that is, if the cost per unit is 
the same in all strata. The cost becomes C = c, + cn, and optimum allo- 
cation for fixed cost reduces to optimum allocation for fixed sample size. 
The result in this special case is as follows: 


Theorem 5.7. In stratified random sampling V(¥,,) is minimized for a 
fixed total size of sample z if 
WS, * NS, 


LMS,  XNS 
This allocation is sometimes called the Neyman allocation, after Neyman 
(1934), whose proof gave the result prominence. An earlier proof by 
Tschuprow (1923) was later discovered. 

A formula for the minimum variance with fixed n is obtained by sub- 
stituting the value of n, in (5.20) into the general formula for V(y,). The 
result is 


(5.20) 


n, =n 


2 A 
V mins) = LA IUE (5.21) 
n 


The second term on the right represents the fpc. 
An alternative proof of the allocation results (Stuart, 1954) uses the 
Cauchy-Schwarz inequality. Minimizing V for fixed C or C for fixed V 


are both equivalent to minimizing the product 
WS? 
VE (z —— Je Chh) 
y 


since V" and C’ are the parts of V and C that depend on the m. The 
inequality states that if a}, b, are two sets of positive numbers, then 


(Dan2)C¥4,7) > (Sanb,)? (5.22) 
the equality occurring only if b,a, is constant for all h. Take 


— WAS ees 


The inequality (5.22) gives 
ve = (SESE) an) = rab nh 2 E Sy 


h 


The minimum value occurs when 


by mc, — constant 
a, WS, 


in agreement with theorem 3.6. 
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5.6 RELATIVE PRECISION OF STRATIFIED 
RANDOM AND SIMPLE RANDOM SAMPLING 


If intelligently used, stratification nearly always results in a smaller 
variance for the estimated mean or total than is given by a comparable 
simple random sample. It is not true, however, that an iy stratified random 
sample gives a smaller variance than a simple random sample. Ifthe values 
of the n, are far from optimum, stratified. sampling may have a higher 
variance. In fact, even stratification with optimum allocation for fixed 
total sample size may give a higher variance, though this result is an 
academic curiosity rather than something likely to happen in practice. 

In this section a comparison is made between simple random sampling 
and stratified random sampling with proportional and optimum allo- 
cation.* This comparison shows how the gain due to stratification is 
achieved. The fpc is ignored. 


The variances of the estimated means are denoted by V,,,, V,,,, ,, and V, 


y pt 
respectively. 


Theorem 5.8. If terms in n/N, are ignored, 


Vost S Vor, € Vian (5.23) 


where the optimum allocation is for fixed n, that 


is, with n, oc N,S,. 
Proof. If the fpc is ignored, 


S? 
p T ei (5.24) 
n 
P a [from equation (5.7), section 5.3] (5.25) 
n 
OM 


= Tete [from equation (5.21), section 5.5] (5.26) 


From the standard algebraic identity 
stratified population, we have 


(i= Msi È È Wn = 


for the analysis of variance of the 


= DÈ (Uni — Y,)? + XN, cR 
= 2 (N, — DS? + Y NAY, — F} (5.27) 


* Interesting discussions of this question are given by Armitage (1947) and Evans 
(1951). 
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Since terms in 1/N, are negligible, this may be written 


NS? = Y N,S,° + $ NY, — Y} 
h h 


Hence 
Vian = s = 2 N, Sè a » NY, = Y) 
n nN nN 
XN(O,-—Y) 
= an 5.28 
prop T nN ( ) 


By the definition of V,,,, we must have V,,.. = V, Their difference is 


1 (> N,S,) 
Vires — Vest = Lr N,S = ME 


ES SN (5.29) 
nN 
where § = YN,S,[N. From (5.29) and (5.28) 
b XN S ln i eee 
Van = Voge + SAAS SP, EIOS (5:30) 


To summarize, there are two components to the decrease in variance as 
we change from simple random sampling to optimum allocation. The 
fizst component (term on the extreme right) comes from the elimination of 
differences among the stratum means; the second (middle term on the 
right) from elimination of the effect of differences among the stratum 


standard deviations. The second component represents the difference in 


variance between optimum and proportional allocation. y 
If the fpc cannot be neglected, the same type of analysis leads to the 


result 
Nim Y. yy 1 2 
= N=" Snn- -EEN — NS] 
Vran Viro» + nN(N "P 1) È AMT 1 N 
(5.31) 


It follows that proportional stratification gives a higher variance than 


simple random sampling if 
INT- I< ÈN- NOS! (5.32) 


Mathematically, this can happen. Suppose that the S,? are all equal to 
S,,2, so that proportional allocation is optimum in the sense of Neyman. 


Then (5.32) becomes Y e 
> NAK, — Y) < (L 1S," 
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or £y =o 
ZI 
L—1 


Those familiar with the analysis of variance will recognize this relation as 
implying that the mean square among strata is smaller than the mean 
square within strata, that is, that the F-ratio is less than 1. 


5.7 WHEN DOES STRATIFICATION PRODUCE 
LARGE GAINS IN PRECISION? 


The ideal variate for stratification is the value of y itself—the quantity 
to be measured in the survey. If we could stratify by the values of y, there 
would be no overlap between strata, and the variance within strata would 
be much smaller than the over-all variance, particularly if there were many 
strata. This situation is illustrated by the example in section 5.3, page 92. 
The population consisted of the sizes (numbers of inhabitants) of 64 cities 
in 1930, stratified by size. Although there were only two strata, propor- 
tional stratification reduced the s.e. (f) from 2365 to 1372. Stratification 
with n; — n, = 12, which is optimum under Neyman allocation, produced 
a further reduction to 1044. 

In practice, of course, we cannot stratify by the values of y. But some 
important applications come close to this situation, and therefore give 
large gains in precision, by satisfying the following three conditions. 


1. The population is composed of institutions varying widely in size. 
2. The principal variables to be measured are closely related to the sizes 
of the institutions. 


3. A good measure of size is available for setting up the strata, 


Examples are businesses of a specific kind, for example, groceries (in 
surveys dealing with the volume of business or number of employees), 
schools (in surveys related to numbers of pupils), hospitals (in studies of 
patient load), and income tax returns (for items highly correlated with 
taxable income). In the United States farms also vary greatly in size as 
measured by total acreage or gross income, but common farm items, such 


as the production of particular crops or types of livestock, often exhibit 
only a moderate correlation with farm size, so 
fication by farm size are not huge. 

If the size of the institution remains stable through time, at least for 
short periods, then its best practical measure is usually the size of the in- 
stitution on some recent occasion when a census was 
in section 5.3 illustrates the situation in which goo 


that the gains from strati- 


taken. The example 
d previous data are 
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available. Table 5.2 shows the S, and the resulting optimum 7, oc Np Sr 
when the allocation is made from 1920 and 1930 data, respectively. 

The 1920 data indicate an n, of 11.56, as against a "true" optimum of 
12.21 for the 1930 data. When rounded to integers, both sets of data give 
the same allocation—a sample size of 12 from each stratum. 

Note that the optimum sampling fraction is 75% in stratum 1 but only 
25% in stratum 2. It is often found that because of the high variability of 
the stratum consisting of the largest institutions the formula calls for 100% 
sampling in this stratum. Indeed, the allocation may call for more than 


TABLE 5.2 
CALCULATION OF THE OPTIMUM ALLOCATION 
1920 Data 1930 Data 


Stratum Nn 


232.04 3712.64 1221 
74.71 3586.08 11.79 


163.30 2612.80 11.56 
58.55 2810.40 12.44 


5423.20 24.00 


100% sampling (see section 5.8). Note also that the S, are smaller in:1920 
than in 1930. The 1920 data give an overoptimistic impression of the pre- 
cision to be obtained in a 1930 survey. As mentioned in section 4.6, the 
possibility of a change in the levels of the S, should always be considered 
though an allowance for change may have to 


7298.72 24.00 


when using past data, even 


be something of a guess. 
Geographic stratification, in which the strata are compact areas such as 


counties or neighborhoods in a city, is common—often for admini- 
strative convenience or because separate data are wanted for each stratum. 
It is usually accompanied by some increase in precision because many 
factors operate to make people living or crops growing in the same area 
show similarities in their principal characteristics. The gains from geo- 
graphic stratification, however, are generally modest. For example, 
Table 5.3 shows data published by Jessen (1942) and Jessen and Houseman 
(1944) on the effectiveness of geographic stratification for a number of 
typical farm economic items. 

Four sizes of stratum are represented—the township, the county, the 
“type of farming" area, and the state. To give some idea of the relative 
sizes of the strata, there are about 1600 townships, 100 counties, and 5 areas 


in Iowa. 
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In the table the precision of a method of stratification is taken as in- 
versely proportional to the value of V(g,,) given by the method. Thus the 
relative precision of method 1 to method 2 is the ratio V;(g,)/ V4(g.), 
expressed as a percentage. The data shown are averages over the numbers 
of items given in the second column. The county is taken as a standard in 
each case. As indicated, the gains in precision are moderate. In Iowa the 
use of 1600 strata (townships) compared with no stratification (state) 


increases the precision by about 30%; that is, it reduces the variance by 
about 25%. 


TABLE 5.3 


RELATIVE PRECISION OF DIFFERENT KINDS OF GEOGRAPHIC 
STRATIFICATION (IN PER CENT) 


Stratum 
Type of 
No. of Farming 
State Items Township County Area State 
Towa, 1938 18 115 100 96 91 
Towa, 1939 19 121 100 97 91 
Florida, 1942 
Citrus fruit area 14 144 100 
Truck farming area 15 111 100 v. 
California, 1942 17 113 100 97 


As regards proportional versus optimum stratification, there are two 
situations in which optimum stratification wins handsomely. The first is 
the case, already discussed, in which the population consists of large and 
small institutions, stratified by some measure of size. The variances S,? are 
usually much greater for the large institutions than for the small, making 
proportional stratification inefficient. The second Situation is found in 
surveys in which some strata are much more expensive to sample than 
others. The influence of the factor Va, may make proportional allocation 
poor. 

When planning an allocation in which the estimated n, do not differ 
greatly from proportionality, it is worthwhile to estimate how much larger 
V(g,) or V(f,) become if Proportional allocation is used. The optima in 
the allocation problem are rather flat (see section 5A.2) and the increase 
in variance may turn out surprisingly small. Moreover, the superiority of 
the optimum, as computed from estimated values of the S» is always exag- 
Berated because of the errors in the estimated S,. The simplicity and the 


self-weighting feature of proportional allocation are probably worth a 
10-to-20 7; increase in variance. 
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5.8 ALLOCATION REQUIRING MORE THAN 100 
PER CENT SAMPLING 


As mentioned in section 5.7, the formula for the optimum may produce 
an n, in some stratum that is larger than the corresponding N,. Consider 
the example on city sizes in section 5.3. A sample of 24 cities, distributed 
between two strata, called for 12 cities out of 16 in the first stratum and 12 
out of 48 in the second. Had the sample size been 48, the allocation would 
demand 24 cities out of 16 in the first stratum. The best that can be done is 
to take all cities in the stratum, leaving 32 cities for the second stratum 
instead of the 24 postulated by the formula. This problem arises only when 
the over-all sampling fraction is substantial and one stratum is much 
more variable than the others. It has occurred in practice on several 
occasions. 

Care must be taken to use the correct formula in predicting the expected 
variance from this allocation or in comparing the allocation with others. 
Formula 5.5 in section 5.3 is appropriate if the n, given by the revised 
optimum allocation are substituted. Formula 5.21 for the minimum vari- 


ance for fixed n 

x (EWSy È MSh 
Vinin(Yst) = Z mS 73 A 
n 

is no longer correct. If stratum 1 is the only stratum in which oversampling 
is indicated, the correct formula for Vmin becomes 


I I SENSIS. iS 
Vinin(Ust) = now n?» non 


where >’ denotes summation over all strata except stratum 1. 


5.9 ESTIMATION OF SAMPLE SIZE WITH 
CONTINUOUS DATA 


Formulas for the determination of under an estimated optimum allo- 
cation were given in section 5.5. The present section presents formulas for 
any allocation, with some useful special cases. It is assumed that the esti- 
mate has a specified variance V. If, instead, the margin of error d (section 
4.4) has been specified, V = (djt)?, where t is the normal deviate corre- 
sponding to the allowable probability that the error will exceed the desired 


margin. 
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Estimation of the Population Mean Y 


Let s, be the estimate of S, and let n, = w,n, where the w, have been 
chosen. In these terms the anticipated V(g,,) (from theorem 5.3, section 
5.3) is 

2.2 
V= Bs Wes = 1 » Ws? (5.33) 
n Wh N 


with Wa = N,/N. This gives, as a general formula for n, 


2.2 
A Sh 
2 wp. 


n=— > (5.34) 
V+—> Wys 
ag N > Was 
If the fpc is ignored, we have, as a first approximation, 
2.2 
jy esu ya sa (5.35) 
V Wha 
If no/N is not negligible, we may calculate n as 
n= To (5.36) 


1 
1 gee Wasi 


In particular cases the formulas take various forms that may be more 
convenient for computation. A few are given. 


Presumed optimum allocation (for fixed n): w, oc W,s,. 


ne È Wrsn)? 


1 (5.37) 
Vt N > Wis? 
Proportional allocation: w, = W, = N,N. 
EAT No 
MERE p gs (5.38) 
ier 
N 


Estimation of the Population Total 


If V is the desired V(Y,,), the principal formulas are as follows: 
General: 
N,2s,7 
> LE 
Wa 


BS aS oo 5.39 
V+ > Nsw f 
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Presumed optimum (for fixed n): 


(x Nasa)? ; 
SRR Oe 5.40 
V+ > Nosy ( ) 
Proportional: 
N 
n=S DMs n-—2 (5.41) 
d D 
N 


Example. This example comes from a paper by Cornell (1947), which 
describes a sample of United States colleges and universities drawn in 1946 by 
the U.S. Office of Education in order to estimate enrollments for the 1946-1947 
academic year. The illustration is for the population of 196 teachers’ colleges 
and normal schools. These were arranged in seven strata, of which one small 
stratum will be ignored. The first five strata were constructed by size of institu- 
tion; the sixth contained colleges for women only. Estimates sj, of the S; were 
computed from results for the 1943-1944 academic year. An "optimum" 
stratification based on these s; was employed. 

The objective was a coefficient of variation of 5% in the estimated total 
enrollment. In 1943 the total enrollment for this group of colleges was 56,472. 
Thus the desired standard error is 


(0.05)(56,472) — 2824 


so that the desired variance is 
V = (2824)? = 7,974,976 


It may be objected that enrollments will be greater in 1946 than in 1943 and 
that allowance should be made for this increase. Actually, the calculation 
assumes only that the cv per college remains the same in 1943 and 1946—an 
assumption that may not be unreasonable. 

Table 5.4 shows the values of Na, Sh» and Ns, 


determining 1. 
The appropriate formula for n is (5.40), w! 
allocation for estimating a total. With only 19 


which were known before 


hich applies to an “optimum” 
6 units in this population, it is 


TABLE 5.4 
DATA FOR ESTIMATING SAMPLE SIZE 

Stratum Ni Sh NaS L^ 

EXE 
1 13 325 4,225 9 
2 18 190 3,420 7 
3 26 189 4,914 10 
4 42 82 3,444 7 
5 73 86 6,278 13 
6 24 190 4,560 10 

/ 

Totals 196 26,841 56 
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improbable that the fpc will be negligible. However, for purposes of illustration, 
a first approximation ignoring the fpc will be sought. This is 


CÈ Nash)? _ (26,841)? 
Hd e MM I 


V ~ 7,974,976 ~ 20-34 


Adjustment is obviously needed. For the correct z in (5.40), we have 
No 90.34 
> ooo = 57.1 
s 1+ 23 Mat da $8038 
y eM 7,974,976 


A sample size of 56 was chosen.* The n, for individual strata appear in the 
right-hand column of Table 5.4. 


5.10 STRATIFIED ‘SAMPLING FOR PROPORTIONS 


If we wish to estimate the proportion of units in the population that fall 
into some defined class C, the ideal stratification is attained if we can place 
in the first stratum every unit that falls in C, and in the second every unit 
that does not. Failing this, we try to construct strata such that the pro- 


portion in class C varies as much as possible from stratum to stratum. 
Let 


A a 
P,=— cS 
n N, n Ny, 


be the proportions of units in C in the hth stratum and in the sample from 


that stratum, respectively. For the proportion in the whole population, the 
estimate appropriate to stratified random sampling is 


N, 
Px => = (5.42) 


Theorem 5.9. With stratified random sampling, the variance of Pst is 


1 S NN, — n) P,Q 
Vi u)-—— h h h. hh 
(Pai) Ni > NET y. (5.43) 


Proof. This is a particular case of the general theorem for the variance 
of the estimated mean. From theorem 5.3 


Mato s 
V(y4) = pz > NAN, — nj) 2- 
n, 


Let Yn: be a variate which has the value 1 when the unit is in C, and zero 


* The arithmetical results differ slightly from those given by Cornell (1947). 
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otherwise. In section 3.2, equation 3.4, it was shown that for this variate 


N, 
—— P, 
WE nOn 


S 
This gives the result. 
Note. In nearly all applications, even if the fpc is not negligible, terms 
in 1/N, will be negligible, and the slightly simpler formula 
z 1 P. WP, 
V(pa) =L ENN, — my) P9 = y alg) (544) 
N n, Ln 
can be used. 
Corollary 1. When the fpc can be ignored, 
E 
vo) = x Wye s (545) 
h 
Corollary2. With proportional allocation, 


N—n 1 N,7P,Qn 
y = Bob eco Ny a 5.46 
(Pu) N nN D N,—1 Cum 


= Ly who, (547) 


For a sample estimate of the variance, substitute p,g;/(n, — 1) for the 


unknown P, Q;/n; in any of the formulas above. 
The best choice of the n, in order to minimize V(p,,) follows from the 


general theory in section 5.5. 
Minimum Variance for Fixed Total Sample Size. 


n, oc N,V NIM, — DN P.Q, = NV P.Q, 

Thus » 

La, NB. (5.48) 
y NN PO. 


ere Cost = € + Xon. 


nj 


Minimum Variance for Fixed Cost, wh 
NX Qo (549) 


n =n T 
? 2 N R P Or! Cy 
The value of n is found as in section 5.5. 


IN STRATIFIED 
PORTIONS 


two useful working rules 
m over simple random 


5.11 GAINS IN PRECISION 


SAMPLING FOR PRO 3 
a 


th If the costs per unit are the same in all strata, 
at (a) the gain in precision from stratified rando 
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sampling is small or modest unless the P, vary greatly from stratum to 
stratum; (b) optimum allocation for fixed n gains little over proportional 
allocation if all P, lie between 0.1 and 0.9. 

To illustrate the first result, Table 5.5 compares stratified random samp- 
ling (proportional allocation) with simple random sampling for three 
strata of equal sizes (W, = 3). Four cases are included, the first having 
P, = 0.4, 0.5, and 0.6 in the three strata and the last (and most extreme) 
having P, — 0.1, 0.5, and 0.9. The first two columns show the variances of 
the estimated proportion, multiplied by n/(1 — f), and the last gives the 


TABLE 5.5 
RELATIVE PRECISION OF STRATIFIED AND SIMPLE RANDOM SAMPLING 
Simple Stratified 
nV(pl — f) nV(p,di — f) Relative 
P, =PQ =4 9 P,Q, Precision (%) 
DnE N EN S o m oe LT 
0.4, 0.5, 0.6 2500 2433 103 
0.3, 0.5, 0.7 2500 2233 112 
0.2, 0.5, 0.8 2500 1900 132 
0.1, 0.5, 0.9 2500 1433 174 


relative precisions of stratified to simple random sampling. The gain in 
precision is large only in the last two cases. 

To compare proportional with optimum allocation for fixed n, it will be 
found that if the fpc is ignored 


y SOWNBOY y XWBAQ 
opt n , prop — 3 


The relative precision of proportional to optimum allocation is therefore 


Yan _ (EWN P,O}? 


Kirn x W,P, hOn 


If ali P, lie between the two values P, arid (1 — Po), we are interested in 
the smallest value the relative precision will take. For simplicity, we con- 
sider two strata of equal size (W, = W;). The minimum relative precision 
is attained when P, = 1and P, = Py. The relative precision then becomes 


Vane _ (0.5 + P499 


Viso» _2(0.25 + E00) 
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Some values of this function are given in Table 5.6. Even with Py equal 
to 0.1, or as high as 0.9, the relative precision is 94 %. In most cases the 
simplicity and the self-weighting feature of proportional stratification more 
than compensate for this slight loss in precision. 

The limitations of the example should be noted. It does not take ac- 
count of differential costs of sampling in different strata. In some surveys 
the P, are very small, but they range from, say, 0.001 to 0.05 in different 


TABLE 5.6 
RELATIVE PRECISION OF PROPORTIONAL TO OPTIMUM ALLOCATION 


Po 040r0.6 0.30r0.7 0.2 0r0.8 0.1 or 0.9 0.05 or 0.95 
- 94.1 86.6 


RP(%) 100.0 99.8 98.8 


strata, Here there would be a more substantial gain from optimum 


Stratification. 


5.12 ESTIMATION OF SAMPLE SIZE WITH 
PROPORTIONS 


Formulas can be deduced from the more general formulas in section 
5.9. Let V be the desired variance in the estimate of the proportion P 
for the whole population. The formulas for the two principal types of 


allocation are as follows: 


Proportional: 
u- LMP n= cu (5.50) 
4 ipu t 
N 
Presumed optimum: 
SQ na Quo — 6D 
V 


1 
14+— >", 
an PA nPndn 


where n, is the first approximation, which ignores the fpc, and n is the 


corrected value taking account of the fpc. In the development of these 


formulas, the factors N,N, — 1) have been taken as unity. 
These results apply to the estimate of a proportion. If it is preferable to 
think in terms of percentages, the same formulas apply if Pj, Qn, V, etc., 
are expressed as percentages. For the estimation of the total number in the 


population in class C, that is, of NP, all variances are multiplied by N?. 
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EXERCISES 


5.1 In a population with N = 6 and L = 2 the values of y; are 0, 1, 2 in 
stratum 1 and 4, 6, 11 in stratum 2. A sample with n = 4 is to be taken. (a) Show 
that the optimum n, under Neyman allocation, when rounded to integers, are 
n, = 1 in stratum 1 and n, = 3 in stratum 2. (b) Compute the estimate Jst 
for every possible sample that can be drawn under optimum allocation and 
under proportional allocation. Verify that the estimates are unbiased. Hence 
find Vo5(Ys:) and V, (y, directly. (c) Verify that V,,(g,)) agrees with the 
formula given in equation 5.5 and that V,,..(5)) agrees with the formula given 
in equation 5.7, page 91. (d) Use of formula 5.21, page 97, to compute V,, (7, 
is slightly incorrect because it does not allow for the fact that the n, were 
rounded to integers. How well does it agree with the correct value? 

5.2 The households in a town are to be sampled in order to estimate the 
average amount of assets per household that are readily convertible into cash. 
The households are stratified into a high-rent and a low-rent stratum. A house 
in the high-rent stratum is thought to have about nine times as much assets as one 
in the low-rent stratum, and 5, is expected to be proportional to the square root 
of the stratum mean. 

There are 4000 households in the high-rent stratum and 20,000 in the low-rent 
stratum. (a) How would you distribute a sample of 1000 households between the 
two strata? (b) If the Object is to estimate the difference between assets per 
household in the two strata, how should the sample be distributed ? 

5.3 The following data show the stratification of all the farms in a county 
by farm size and the average acres of corn (maize) per farm in each stratum. 


Number of Average Standard 

Farm Size Farms Corn Acres Deviation 
(acres) Ny Y, Sh 
0-40 394 5.4 8.3 
41-80 461 16.3 13.3 
81-120 391 24.3 15.1 
121-160 334 34.5 19.8 
161-200 169 42.1 24.5 
201-240 113 50.1 26.0 
241- 148 63.8 35.2 

Total or mean 2010 26.3 


For a sample of 100 farms, compute the sample sizes in each stratum under 


(a) proportional aliocation, (6) optimum allocation. Compare the precisions 
of these methods with that of simple random sampling. 


5.4 Prove the result stated in formula 5.31, section 5.6: 


(N — r) 1 
Vran = Voron + ls NAY, — Y? — x ÈN- mosie 
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5.5 A sampler has two strata with relative sizes W}, W. He believes that 
S3, S; can be taken as equal but thinks that ca may be between 2c, and 4c,. He 
would prefer to use proportional allocation but does not wish to incur a sub- 
stantial increase in variance compared with optimum allocation. For a given 
cost C = cın, + Cons, ignoring the fpc, show that 

Fons Wie Were _ 
V5) (Ww, Và * Wa V cz)? 


If W, = W,, compute the relative increases in variance from using proportional 
allocation when c/c; = 2, 4. 

5.6 A sampler proposes to take a stratified random sample. He expects 
that his field costs will be of the form Y cm. His advance estimates of relevant 
quantities for the two strata are as follows: 


Stratum W, S, C, 
1 0.4 10 $4 
2 0.6 20 $9 


(a) Find the values of n,/n and n/n that minimize the total field cost fora given 
value of V(g,.). (b) Find the sample size required, under this optimum allocation, 
to make V(7,) = 1. Ignore the fpc. (c) How much will the total field cost be? 

5.7 After the sample in exercise 5.6 is taken, the sampler finds that his field 
costs were actually $2 per unit in stratum 1 and $12 in stratum 2. (a) How much 
greater is the field cost than anticipated? (5) If he had known the correct field 
costs in advance, could he have attained V(g,,) = 1 for the original estimated 
field cost in exercise 5.6? (Hint. The Cauchy-Schwarz inequality, page 97, 
with V = 1, gives the answer to this question without finding the new allocation.) 

5.8 In a stratification with two strata, the values of the W, and S, are as 


follows: 


Stratum Wr Sh 
1 0.8 2 
2 0.2 4 


Compute the sample sizes 7, 7 in the two strata needed to satisfy the following 
conditions. Each case requires a separate computation. (Ignore the fpc.) 
(a) The standard error of the estimated population mean ¥,, is to be 0.1 and the 
total sample size n = n + n, is to be minimized. (b) The standard error of 
the estimated mean of each stratum is to be 0.1. (c) The standard error of the 
difference between the two estimated stratum means is to be 0.1, again minimizing 
the total size of sample. 

5.9 With two strata, a sampler would like to have n, = ny for administrative 
convenience, instead of using the values given by the Neyman allocation. 1f 
V(y,), Vos(U.) denote the variances given by the n, = n, and the Neyman 
allocations, respectively, show that the fractional increase in variance 


VGs) — Vont) S C MJ jj 
Vs Gs w +1 
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where r = n;[n; as given by Neyman allocation. For the strata in exercise 5.8, 
case a, what would the fractional increase in variance be by using n; = n instead 
of the optimum? P 

5.10 If the cost function is of the form C = cy + > t, V nj, where c, and the 
t are known numbers, show that in order to minimize V(¥,,) for fixed total cost 
n, must be proportional to 


ex 
E 
Find the 7, for a sample of size 1000 under the following conditions: 
Stratum Wha Sh [3 
1 0.4 4 1 
2 0.3 5 2 
3 0.2 6 4 


5.11 If V,,,,(Z,) is the variance of the estimated mean from a stratified 
random sample of size n with proportional allocation and V(j) is the variance of 
the mean of a simple random sample of size n, show that the ratio 


Voros (H5) 
va) 
does not depend on the size of sample but that the ratio 
V mings) 
H Vorons) 
decreases as m increases. (This implies that optimum allocation for fixed n 


becomes more effective in relation to proportional allocation as 7 increases.) 
(Use formulas 5.7 and 5.21.) 


5.12 Compare the values obtained for V(p,) under proportional allocation 


and optimum allocation for fixed sample size in the following two populations. 
Each stratum is of equal size. The fpc may be ignored. 


Population 1 Population 2 
Stratum P, Stratum P, 
D eR ARES LE 
1 0.1 1 0.01 
2 0.5 2 0.05 
3 0.9 3 0.10 


What general result is illustrated by these two populations? 
5.13 Show that in the estimation of Proportions the results corresponding to 


theorem 5.8 are as follows: 
> WAP, — Py 


n 


€ ——)2 
Viron = Vopt + X LAAN x VPO) 
n 


ran = Vorop + 
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where YP.Q, —-YWVPQ, 


5.14 In a firm, 62% of the employees are skilled or unskilled males, 31 vA 
are clerical females, and 7% are supervisory. From a sample of 400 employees 
the firm wishes to estimate the proportion that uses certain recreational facilities. 
Rough guesses are that the facilities are used by 40 to 507; of the males, 20 to 
30% of the females, and 5 to 10% of the supervisors. (z) How would you 
allocate the sample among the three groups? (6) If the true proportions of users 
were 48, 21, and 4%, respectively, what would the s.e. of the estimated proportion 
Pst be? (c) What would the s.e. of p be from a simple random sample with n = 
400? 
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CHAPTER 5A 


Further Aspects 
of Stratified Sampling 


5A.1 EFFECTS OF DEVIATIONS FROM THE OPTIMUM 
ALLOCATION 


The following sections discuss a number of special topics in the practical 
use of stratified sampling. Sections 5A.1 to 5A.7 deal with problems that 
may come up in the planning of the sample, and sections 5A.8 to 5A.12 
with techniques of analysis of results, including short cuts in the compu- 
tation of standard errors. Finally, an introductory account is given of 
some useful results when the data are taken for analytical purposes 
(section 54.13). The present section considers the loss in precision by 
failure to achieve an optimum allocation of the sample. 

Suppose that it is intended to use optimum allocation for given n. The 
sample size n,’ in stratum h should be 


, _ 2(W,S;) 
LAM 5A.1 
ny 5 W,S, ( ) 
From equation 5.21, page 97, the resulting minimum variance is 
2 1 1 
Vois.) = : (X WS, — F > W,S,2 (5A.2) 


In practice, since the S, are not known, we can only approximate this 


allocation. If ñ, is the sample size used in stratum h, the variance actually 
attained, from equation 5.5, page 91 is 


- WS 1 
V(g)-xX————Yws (5A.3) 
fi N 
The increase in variance caused by the imperfect allocation is 


= t W,2s,2 
V) — Vrind) = Se _ Lig wes ye 
ty, n 
114 
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In the first term on the right substitute for W,S, in terms of n,’ from (5A.1). 
This gives the interesting result 


2 12 
VQ.) — Vas) = aS” (z atan) 


™, 
2 ose 
E? [03 MiSs) Cin = ny) (5A.4) 
n fy 


Reverting to equation 5A.2, if the fpc (last term on the right) is negligible, 
we see that 
Vain) _ (X WIS! 


n n? 


Hence the proportional increase in variance resulting from deviations from 
the optimum allocation is 


Vat) — Vous (3). 1 $ Gin = m (5A.5) 
Vmin) — ni o 


Where ñ, is the actual and n,’ the optimum sample size in stratum 4. If the 
fpc is not negligible, the = sign in (5A.5) becomes >. 

It is difficult to visualize the practical implications of this result with- 
out working out numerical examples. One general consequence, though 
somewhat conservative, is helpful. Let g be the greatest of the values 
li, — n/| [fi found in any of the strata. Then from (5A.5) : 


Vien « ly E =g 
V, n fi 


For instance, if the maximum deviation |/i, — n,’|, expressed as a fraction 
of fi,, is 0.2, or 20%, the proportional increase in variance cannot exceed 
(0.2? = 0.04, or 4%. If g = 30%, the proportional increase is at most 
9%. In this sense the optimum can be described as flat. f s 

This rough rule usually overestimates the actual increase in variance bya 
substantial amount. Table 5A.1 gives an example with three strata for 


TABLE 5A.1 
EFFECTS OF DEVIATIONS FROM OPTIMUM ALLOCATION 
ny fiy lân — m| (y = my’? 
Stratum (opt) (act) Âr Ry 

1 200 150 0.33 16.7 

2 100 120 0.17 3.3 

3 40 70 0.43 12.9 

[m 32:9 


Total 340 340 
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n = 340. Optimum allocation requires sample sizes of 200, 100, and 40, 
whereas the sizes actually used are 150, 120, and 70. 

Since the value of g is 0.43 (stratum 3), the rough rule gives 18% as the 
proportional increase in variance. From the column on the right, the 
actual increase is seen to be 32.9/340 = 9.7%. 

Evans (1951) examined the same question in terms of the effects of errors 
in the estimated S, and developed an approximate rule showing whether an 
estimated optimum is likely to be more precise than proportional allo- 
cation. He supposes that the coefficient of variation of the estimated S, is 
the same in all strata. This assumption is appropriate when the S, have 
been estimated from a preliminary sample of the same size in each stratum. 
He shows how to compute the size of a preliminary sample needed to make 
an “optimum” allocation better, on the average, than proportional allo- 
cation. Previously, Sukhatme (1935) showed that a small initial sample 
usually gives a high probability that “optimum” allocation will be superior 
to simple random sampling. 


54.2 EFFECTS OF ERRORS IN THE STRATUM SIZES 


For a desirable type of stratification, the stratum totals N,, may not be 
known exactly, being derived from census data that are Out of date. 
Instead of the true stratum proportions W,, we have estimates Wa. The 
sample estimate of Y is Yw;g,. 

In general terms, the consequences of using weights that are in error are 
as follows: 


1. The sample estimate is biased. Because of the bias, we measure the 
precision of the estimate by its mean square error about Y rather than by 
its variance about its own mean (see section 1.8). 

2. The bias remains constant as the sample size increases. Consequently, 
a size of sample is always reached for which the estimate is less precise than 
simple random sampling, and all the gain in precision from stratification 
is lost. 


3. The usual estimate s(y,;) underestimates the true error of j,, since it 
does not contain the contribution of the bias to the error. 


To justify these statements, note that in repeated sampling the mean 
value of the estimate is Yw,Y,. The bias therefore amounts to 


DO, — W) Y, 


It is independent of the size of the sample. In finding the mean square 
error (MSE) of the estimate, it is easy to verify that the variance term is 
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given by the usual formula, with w, in place of W,. Hence 
d 25 2 EN 
MSE (7s) = S C — fa) + [È (wa —W)¥,P (54.6) 
h 


This expression was given by Stephan (1941). Finally, the usual formula 
for s*(j,,) is clearly an unbiased estimate of the first term in (5A.6) but 
takes no account of the second term. 


Example. This illustrates the loss of precision from incorrect weights when 
Stratification is (a) slightly effective, (b) highly effective. Consider a large 
population with 5? = 1, divisible into two strata with W, = 0.9, W, = 0.1. 
We shall assume S, = S = Sh. Then, neglecting terms in 1/N;, 


S? Y W,SS +> WK, — YP (5A.7) 
= S + W WY, E Y.) 
that is, 1 = 5,2 + 009(Y, — Yo)* 


In (a) take Y, — Y, =1. Then S}? = 0.91, and "proportional stratification 


reduces the variance by 9%, compared with simple random sampling. 

In (b) take Y, — Y, = 3, giving Sj? = 0.19, a reduction in variance of more 
than 80%. 

With two strata, the bias may be written 

Ov — WjX Y, — Yə) 

since (w, — = —(w,— We). Suppose that the estimated weights are 
wi O08 and D, C OK, The bias amounts to (0.021) = 0.02 in (a) and to 
0.06 in (b). Hence we have the following comparable variances for a sample of 
size n: 


. 1 
Simple random sampling: V9) = 7 
Stratified random sampling: al 
(EACH) de 51300008 


0.19 
Okos ELLOS 


TABLE 5A.2 
COMPARABLE VALUES OF vg) 
Stratified Random 


Simple 

n Random 
50 0.0200 
100 0.0100 
200 0.0050 
300 0.0033 
400 0.0025 


1000 0.0010 
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Relative to (a), simple random sampling begins to win at n = 300. There is 
little to choose between the two methods, however, up to n = 1000. 

In (b), with more at stake, stratification is superior up to n = 200, although 
most of the potential gain has already been lost at this sample size. Beyond 
n — 300 stratification becomes markedly inferior to simple random sampling. 
Accurate estimation of the W, is particularly important when stratification is 
highly effective or when the sample size is large. 


In some surveys a large preliminary sample of size n' can be taken in 
order to estimate the W,. This technique, known as double sampling or 
two-phase sampling, has numerous applications and is discussed in Chapter 


12. It will be shown that with double sampling the mean square error of 
Yq. is approximately 


LMS? XWG- Y 
"mn n’ 


By comparing this MSE with S?/n, as given by equation 5A.7, we see that 
most of the gain from stratification is retained provided that n’ is much 
greater than n. To put it more generally, a set of estimated weights pre- 
serves most of the potential gain from stratification if the weights are much 


more accurately estimated than they would be from a simple random 
sample of size n. 


54.3 THE PROBLEM OF ALLOCATION WITH MORE 
THAN ONE ITEM 


Since the best allocation for one item will not in general be best for 
another, some compromise must be reached in a survey with numerous 
items. The first step is to reduce the items considered in the allocation to a 
relatively small number thought to be most important. If good previous 
data are available, we can then compute the optimum allocation for each 
item separately and see to what extent there is disagreement. In a survey 


of a specialized type the correlations among the items may be highrand the 
allocations may differ relatively little. 


Example. Data given by Jessen (1942) illustrate a farm survey of this kind. 
The state of Iowa was divided into five geographic regions, each denoted by its 
major agricultural enterprise. Suppose that these regions are to be used as 
Strata in a survey on dairy farming. The three items of most interest are the 
number of cows milked per day, the number of gallons of miik per day, and the 
total annual cash receipts from dairy products. From a survey made in 1938, the 
estimated standard deviations 5, within strata are shown in Table 5A.3. In 


Table 5A.4 the optimum Neyman allocations based on these s; are given for the 
individual items in a sample of 1000 farms. 
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TABLE 5A.3 
STANDARD DEVIATIONS WITHIN STRATA 


Sh 
Receipts 
Sh Sh for Dairy 
ah = Ny Cows Gallons Products 
Stratum ^^'N Milked of Milk ($) 
Northeast dairy 0.197 4.6 11.7 332 
Cash grain 0.191 3.4 9.8 357 
Western livestock 0.219 3.3 7.0 246 
Southern pasture 0.184 2.8 6.5 173 
Eastern livestock 0.208 3.7 9.8 279 
TABLE 5A.4 
SAMPLE SIZES WITHIN STRATA (n = 1000) 
Allocation 
Optimum for 
Average 
Stratum Proportional Cows Gallons Receipts my, 
Saigo esca EARUM" eee 
Northeast dairy 197 254 258 236 250 
Cash grain 191 182 209 246 212 
Western livestock 219 203 171 194 189 
Southern pasture 184 145 134 115 131 
Eastern livestock 208 216 228 209 218 
TABLE 5A.5 


ExPECTED VARIANCES OF THE ESTIMATED MEAN 


Type of allocation Cows Gallons Receipts 
Optimum 0.0127 0.0800 76.9 
Compromise 0.0128 0.0802 71.6 
Proportional 0.0131 0.0837 80.9 


tions differ only moderately from each other. 
iate in the same direction from a proportional 
allocation, Thus, in the first stratum, proportional allocation suggests 197 farms, 
and the individual allocations lead to numbers between 236 and 258. The 
average of the optimum sample sizes for the three items, shown in the right-hand 


Column, provides a satisfactor compromise allocation. E y 
Table PAS shows the EN sampling variances of Js» as given by the 


The individual optimum alloca 
With one exception, all three devi 
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individual optima, the compromise, and the proportional allocations. The 
formulas are as follows: 


$ 
CE Msn)? ^ L y hsa) NB 


The compromise allocation gives results almost as precise as if it were possible 
to use separate optimum allocations for each item. What is more noteworthy is 
that proportional allocation is only slightly less precise than the compromise or 
the individual optima. Further, Table 5A.5 overestimates the precision of the 
optima and of the compromise, since these allocations were made from estimated 


variances. This result is another illustration of the flatness of the optimum 
mentioned in section 5A.1. 


5A.4 OTHER METHODS WITH MORE THAN ONE ITEM 


In some surveys the optimum allocations for individual variates differ 
so much that there is no obvious compromise. Some principle is needed 
that will determine the allocation to be used, although none seems best for 
all applications. Two useful ones suggested by Yates (1960) are presented. 

The first applies to surveys taken for a specialized objective, in which the 
loss due to an error of a given size in an estimate can be measured in terms 
of money or utility, as discussed in section 49. With v variates and 
quadratic loss functions, it may be reasonable to express the total expected 


loss as a linear function 
Lm) = aV, + aV, +--+ ay, (5A.8) 


where the a’s are known numbers, and V, — Vgs) for the jth variate, 
With a linear function for the costs of sampling, we have 


Cc 4 Yon, (5A.9) 
The n, are determined to minimize 


(C + L). By ordinary calculus methods 
we find 


w 
n = U. | Xa (5A.10) 


Ch j=1 
where S,, is the variance of the Jth variate in stratum A. 


In the second approach we Specify the desired standard error or variance 


V; (j= 1,2, +- - , v) for each variate, If population means are being esti- 
mated, this implies that 


2 
^N $^ G-L2.--,» SCAT 


Inequality signs are used because the most economical allocation may 
supply variances smaller than the desired V; for some of the items. 
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In this approach the cost (equation 5A.9) is minimized subject to the 
tolerances V. 

The first step is to work out the optimum allocation for each variate 
separately and to find the cost of satisfying its tolerance. Take the variate, 
Say y,, for which the cost C, is highest and examine whether the optimum 
allocation for y, satisfies all the other (v — 1) tolerances. If so, we use this 
allocation and the problem is solved, because no other allocation will 
satisfy the tolerance V, for y, at a cost as low as C}. 

If some of the tolerances are not met, the problem is more difficult. 
Dalenius (1957) gives an ingenious graphical solution usable when there 
are only two strata, and Yates (1960) gives a more general mathematical 
approach. These methods are illustrated by the following examples. 


Example I (Two Strata, Three Variates). The W,, S;, appear in columns 1 
to 4 of Table 5A.6. It is assumed that the fpc is negligible and that c, = constant. 


TABLE 5A.6 
ARTIFICIAL DATA FOR Two STRATA, THREE VARIATES 
Column (1) D 6) @ (5 (6) (7) (8) (9) 


Stratum | W, Sin Sen Son WaSin WrSon WySs (8/6) (PIG) 


1 08 4 2 1 3.2 1.6 0.8 0.8 0.2 
2 02404 6 8 0.8 1.2 1.6 1.8 3.2 


Totals 4.0 2.8 24 2.6 3.4 


The extra computation required when these assumptions do not hold is minor. 
Under optimum allocation for the jth variate, 


WSjn)* 
VU; x) = DUE 


Columns 5 through 7 give the material for computing the individual optimum 
variances, 


Case 1. This illustrates a situation with an easy solution. Suppose that the 
desired s.e. for each estimate is 0.1, so that each V; — 0.01. From columns 5 
through 7 it is clear that the first variate requires the largest sample: (4.0)*/0.01 
= 1600. From column 5, its allocation is 7 = 1280, js = 320, where the 
first subscript signifies that the allocation is the optimum for yi. 4 

We now determine whether this solution supplies the two other desired 
tolerances. A useful general result for this purpose is as follows. If the optimum 
allocation for the jth variate is used, the variance obtained for the kth variate is 


L 206 2 3 W,2S,, 
WP Sua _ > msa CES ) 5A.12 
D pU n Wi Sin aay) 


Vrs) = 
A 
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since jm, = nW,Syal Y W;S;. We require this result for j — 1, k — 2,3. In 
columns 8 and 9 the terms inside the parentheses are computed from columns 5 
through 7. These results give 


(4.0)(2.6) 

1600 
Both tolerances are more than satisfied. 

Case 2. With the same data, the desired s.e.'s are 0.1 for y, and ys, but 0.08 
for ys, so that V, = 0.0064. The foregoing solution no longer holds, since it 
gives V = 0.0085. The graphical method of Dalenius may be used. Equation 
5A.11 shows that for any specified variate the values of 7, 7g which satisfy the 


tolerance exactly lie on a hyperbola in n, and n; Figure 5A.1 shows the three 


ead for this problem. With y,, for instance, the equation of the hyper- 
ola is 


(4.0)(3.4) 


iV Gao st) = 1600 


= 0.0065, Vss) = = 0.0085 


2 2 
(ASD eS 1024 0.64 ogy 
n n n na 
The region in which all these requirements are met is the area above and to the 
right of the dotted lines AB, BC. We seek the point in this region at which 
n, + n is a minimum. This is clearly the point B. Hence the graph gives the 
solution 7, = 1200, n; = 430, n = 1630. This solution can be found arithme- 
tically, since it is the point at which both variates Yı and yz meet their tolerances 


E reader may verify that the arithmetical method gives m, = 1200, 
Ta = 431. 


1200 


1000 


800 


n2 


400 


200 


200 400 600 800 1000 1200 1400 1600 
m 


Graphical solution of allocation problem (three variables, two strata). 


FURTHER ASPECTS OF STRATIFIED SAMPLING 123 


_ With two strata, the graphical method works for any number of variates. The 
situation with more than two strata, which is more complex, is illustrated by 
example 2. 


Example 2. (Four Strata, Two Variates) The data are shown in columns 1 


TABLE 5A.7 
ARTIFICIAL DATA FOR FOUR STRATA, TWO VARIATES 
Column (1) (2) (3) (4) (5) (6) (0) 


Cu NR E a c 
Stratum Wy, Sy Son WS, W,Sg, (SFA (47/6) 


1 0.4 5 1 2.0 0.4 0.08 10.00 
2 0.3 5 2 1.5 0.6 0.24 3.75 
3 0.2 5 4 1.0 0.8 0.64 1.25 
4 5 8 


Totals 5.0 2.6 2.24 15.31 
eS SS ESOEEES 


ae 3 of Table 5A.7. The problem is to find the smallest sample size for 
whic 
V,<0.04, Va $0.01 


As before, we first work out the optimum allocation and resulting sample size 


for each variate. From columns 4 and 5, 


= 25 = 
iV) ==> ” =504 A 
: 6.76 
Ves) = — > C! 57 nynyl = 676 


n 


From equation 5A.12 and column 7, the variance obtained for Yı if allocation 2 


is used with n — 676 is 
Pe (2.6)(15.31) _ 39-81 _ 
Vis) = — gu 616 0.0589 


This is larger than the value 0.04 specified for Vj. 
Ue must seek a compromise allocation that sai 
Sing Lagrange multipliers 4, and 42, We find the values 
< E WS? E WS 
S am TA EA. - 
h=1 h=1 "Mn h=1 h 
. Differentiation with 


isfies both tolerances exactly. 
of n; that minimize 


in this example c, = 1, but the general method is given 
€spect to m, leads to 
2 
A WPS? de mus 
Cn Ch (5A.13) 


NU Á0.3 
mn Wis, We Son” 
z/ h a * a ; 


Ch 
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To obtain a solution, the values of Ži, 4g, and n that satisfy the two variance 
conditions and the cost condition must be found. Since there is no simple 
explicit solution, some method of successive approximation must be selected. 
An approach is used in which the n, are determined first. de rn E 

From (5A.13), the optimum z, is a kind of weighted combination of Sins 
and $57. Clearly, 2, and A, enter (5A.13) only in their ratio Žila. Since Sy 
and S? may have widely different values, it is hard to guessa good first approxi- 
mation to 4/4... A change of scale that gives a better initial approximation is 


If 4,/2, tends to infinity, n, in (54.13) becomes 1n the optimum under 


allocation 1. Similarly, if 2,/2, = 0, N, — a}. It follows that with the correct 
value of 4 = AMA, + 25), (5.13) is equivalent to 


_ onVAGny +0 — Dan)? 


= (SA.14) 
o EVA Dn 
For any value of A. write 
E $10) " $«(4) 
VG; s) E > VG 5) = m 
We want to find 4 and n such that 
2 
ee =V,=0.04, : ES = V0.1 (5A.15) 
From the initial calculations, we know that when 2 = 1, 4,(2) = 25, this being 
its minimum value, and that when 4 = 0, 4,(4) = 39.81. 


As an approximation, 
assume that ¢,(A) is a parabola in 4, with its vertex at 2 = 1. This gives 


À 25 + 14.8111 — 2? 
$10) TA LATTE ORMS 0.04 (5A.16) 
n n 
For ¢$a(4), the minimum is 6.76, at 4 — 0, 


computing V(z, .)) under allocation 1. From columns 4 and 6 of Table 5A.7, 
we obtain ¢.(1) = (5.0)(2.24) = 11.20. The parabolic approximation gives 


2) _ 6.76 + 4.4472 
20 - LU = 0.01 (5A.17) 


Equations 5A.16 and 5A.17 a 
approximations. 


Its value at A = | is found b 


re easily solved to give 4 = 0.41, n = 751 as first 


> these two sample sizes should be almost 
equal (and, we expect, not far from n , 
Equation 5A.14 gives the ;n,. If 


Mn = VAGn[ny E O Ign, IE 


then (5A.14) may be expressed in the form 


Np, r; 


——-^ 
iy SSSA 
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This form is convenient because from the Neyman allocation rule 


AA WS, E W, Sy, 
n > WS" n SWS, 


and the quantities W,Sy,, W,Sy, have already been computed in columns 4 and 
5 of Table 5A.7. 


TABLE 5A.8 
CHECK ON First APPROXIMATION TO THE ALLOCATION 
Column (1) (2) (3) (4) (5) (6) (7) 
Stratum (= ej Th f Um NSE STB ny 
n n > Th n n/n n/n 

1 0.16 0.0225 0.2808 0.2652 15.08 0.60 194 

2 0.09 0.0529 0.2610 0.2465 9.13 1.46 180 

3 0.04 0.0961 0.2704 0.2554 3.92 2.51 187 

4 0.01 0.0961 0.2466 0.2329 1.07 2.75 171 
Totals 1.0588 1.0000 29.20 7.32 732 


Table 5A.8 shows the rest of the calculations, column 4 giving the n,/n. To 
find the resulting variances for a sample of size n, we use the result 


= WES 1o Wis 
VAs) => a 22, ZW 


The quantities W,2S,,2/n,/n are given in columns 5 and 6. From the column 
totals, 


Med 
MA st) = ie = 0.04, n = 730 


7.32 
Vs) = UE 0.01 n = 732 


The two n’s are so close that we accept this allocation and take n = 732. The 
values of the ny, shown in column 7, are found by multiplying column 4 by 732. 

If the two values of 1 given by the first approximation differ materially, a 
Second approximation to 2 and n is computed, either graphically or by the 
Parabolic functions, using the already computed values of ¢,(4) and 440). With 
two variates, the same method applies to any number of strata. 

With more than two strata and more than two variates, the best computing 
Method is not clear. Some results obtained in the mathematical study of pro- 
1. More complicated problems can also arise; we might 


Bramming may be usefu à u rise: 
limits to the variances for certain subdivisions of the 


Wish to specify upper 1 
Population as well as over-all variances. 
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5A.5 TWO-WAY STRATIFICATION WITH SMALL 
SAMPLES 


Suppose that there are two criteria of stratification, say by R rows and 
C columns, making RC cells. If 1 > RC, every cell can be represented in 
the sample. A problem arises when n < RC, and we would like the sample 
to give proportional representation to each criterion of stratification. 


TABLE 5A.9 
NUMBER AND PROPORTION OF SCHOOLS IN EACH CELL 
a Expenditure per Pupil 
ol 

City A B C D Totals ni, 
I my; 15 21 17 9 m, 62 

P,; 0.091 0.127 0.103 0.055 Py, 0.376 4 
I m»; 10 8 13 Ji m». 38 

P, 0.061 0.049 0.079 0.042 P,. 0.231 2 
Ul M3; 6 9 5 8 Ms, 28 

P3; 0.036 0.055 0.030 0.049 Ps. 0.170 2 
IV ma; 4 3 6 6 ma. 19 

Py; 0.024 0.018 0.036 0.036 P4. 0.114 1 
v ms; 3 2 5 8 ms. 18 

Ps, 0.008 0.012 0.030 0.049 | P, 0.109 1 

Totals m; 38 43 46 38 165 
P; 0.230 0.261 0.278 0231 1.000 
n 2 3 3 2 


In a simple method developed by Bryant, Hartley, and Jessen (1960) the 
technique requires only that n exceed the greater of R and C. 


To illustrate this method, suppose that a small population of 165 schools 
has been stratified by size of city into five classes and by average ex- 


penditure per pupil into four classes. The numbers of schools mj; and the 
Proportions of schools P,, = m,;/165 in each of the 20 cells are shown in 
Table 5A.9. 

The objective is to 
selection while givin 
In this illustration n 


give each school an approximately equal chance of 
g each marginal class its proportional representation. 
= 10. Compute the numbers n; = nP, and n, = nP p 
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where these products are rounded to the nearest integers (with a further 
minor adjustment, if needed, so that the 7; and the n; both add to 7). 
These numbers are shown in Table 5A.9. 

The next step is to draw n = 10 cells with probability n; ;[n? for the 
ijth cell. This is done by constructing an n X n square (Table 5A.10). In 
row 1 one column is drawn at random. In row 2 one of the remaining col- 
umns is drawn at random, and so on. At the end, each row and column 


TABLE $5A.10 
10 x 10 SouARE FOR DRAWING THE SAMPLE 
Column 


Row 


PON 


uickly made by a random permu- 
Its of one draw are indicated by 


contains one unit. (This draw is most q 
tation of the numbers | to 10.) The resu 


X's in Table 5A.10. r 4 
Note that columns 1 and 2 are assigned to marginal stratum 4, since 


n; -2. Similarly, rows 1 through 4 are assigned to marginal stratum I, 
Since n, = 4, and so on. This completes the allocation of the sample to 
the 20 cells. The allocation appears in more compact form in Table 5A.11. 
Two schools are drawn at random from the 15 schools in cell IA, and so on. 
The probability that a school in row i, column jis drawn is proportional to 
nin |P. Thus the probabilities are not equal, though they will be 
approximately so if Pi; = n,n ;|n. i 

An unbiased estimate of the mean per school is 
1—n°Pi; 

Sy 


yu cm 
n nn; 
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where y; is the sample total in the ijth cell. If, however, P;; = n; n [n^ 
the sample mean ğ is probably preferable, since its bias should be negli- 
gible. A sample estimate of variance is available for both the unbiased and 
biased estimates, provided that n is at least twice the greater of R and C 
and that at least two units are drawn in every row and column. 

If P;; differs markedly from n; n ;[n? in some cells, an extra step keeps 
the probabilities of selection of schools more nearly constant. After 
computing the n; and n; examine the quantities D; = nP; — n; n j[n, 


TABLE 5A.11 
ALLOCATION OF THE SAMPLE TO THE 20 CELLS 
As TB UG "D Total 
I 2AA T SSMO 4 
IL ORSON 2 E0 2 
HI 2 10 NT 2 
IV (Oe 1 
M (y ca L0 e yl 1 
Total megs) be) 10 


after rounding them to integers. If, in any cell, D; is a positive integer, 
automatically assign D; units to this cell. Reduce n, the n, , and the n ; 
by the amounts required by this fixed allocation and carry out the re- 
maining allocation as before. 

An earlier technique for this problem, including the situation in which a 
substantial number of cells are empty, was named controlled selection by 
Goodman and Kish (1950). In their applications rows represent the princi- 
pal stratification, one unit being drawn from each row. They show how to 
find a limited number of acceptable allocations, each with its appropriate 
probability, such that cells are selected with probabilities P,;. 


5A.6 THE CONSTRUCTION OF STRATA 


This topic raises several questions. What is the best characteristic for 
the construction of strata? How should the boundaries between the strata 
be determined? How many strata should there be? For a single item 
or variable y the best characteristic is clearly the frequency distribution 
of y itself. The next best is presumably the frequency distribution of some 
other quantity highly correlated with y. Given the number of strata, the 
equations for determining the best stratum boundaries under proportional 
and Neyman allocation have been worked out by Dalenius (1957), and 
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quicker approximate methods by several workers. We shall consider 
Neyman allocation, since it is usually superior to proportional allocation in 
populations in which gains from stratification are greatest. It is assumed 
at first that the strata are set up by using the value of y itself. 

Let Yo, yz, be the smallest and largest values of y in the population. The 
problem is to find intermediate stratum boundaries yj, Y2 `` `> Y¥z—1 Such 
that ; 


1/z 3 ETT z 
va - i3 Ws) - x SHS (5A.18) 
nM Na 


is a minimum. If the fpc is ignored, it is sufficient to minimize > W,S;. 
Since y, appears in this sum only in the terms W,S, and W,,,S,.,, we 
have 


a G) a 
Eu =— S, — (Whar) 
Oy, (X W,S;) y, (Wy h) ar 2. hl wwe?) 
Now if f(y) is the frequency function of y, 
m=" roa. OW — fln) (5A.19) 
Yai Oy, 
Further, 


[on] (5A.20) 
Í ^ f()dt 


VA-1 


vb 
W,S,? =| È f(t) dt — 
VA-1 


Differentiation of (5A.20) gives 
sy I + 2WySp 258 = y? S) — 24 f Gn) + i fn) 
Yn 


— = Yr 


Oy, 


where y, is the mean of y in strat ; [ 
and the equal quantity Sj? f(yn) to the right side. 


um A. Add S,? QW, Oy, to the left side, 
This gives, on dividing 


by 25,, 
I 2 + S. 2 
AMS) _ s, ome m, 29» = f(y) (9, be) Sz 
^ Oy, Oy, dy, 2 hS y 
Similarly we find = | 
P 
A(WrsrSnsa) 1 fn) (ys = nia) + Shi ^ 
Oy, 2 Sia E 


Hence the calculus equations for y, are S : 
2 
On — wn)? + S (y — ea E Sha (= 1,2, L—1) A21) 


S, Sia 
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Unfortunately, these equations are ill adapted to practical computation, 
since both 4j, and S, depend on y,. A quick approximate method, due to 
Dalenius and Hodges (1959), is presented. Let 


zo) - [' VFO dt 


If the strata are numerous and narrow, f(y) should be approximately con- 
stant (rectangular) within a given stratum. Hence. 


W, = IE f(t) dt = f (yy — ra) 


Ya-1 


E 
S,-— Jn (Yn — Yra) 


Zr — Zi = P VFO dt= NACA — ya) 


Va-1 


where f, is the “constant” value of f(y) in stratum h. By substituting these 
approximations, we find 


— L L L 
VRMS XY), zy — GA2D 
= = -1 


Since (Zz, — Zi) is fixed, it is easy to verify that the sum on the right is 
minimized by making (Z, — Z, 4) constant. a. 

Given f(y), the rule is to form the cumulative of V/f(y) and choose the 
y; So that they create equal intervals on the cum "V/f(y) scale. Table 5A.12 
illustrates the use of the rule. 


TABLE 5A.12 
CALCULATION OF STRATUM BOUNDARIES BY THE CUM Vf (y) RULE 
Industrial Loms;, ) Cum | Industrial Loans a, ) Cum 
Total Loans “° fy Vf (y) Total Loans ^? FY Vf) 


0-5 3464 58.9 
5-10 2516 — 109.1 
10-15 2157 155.5 
15-20 1581 195.3 
20-25 1142 2291 
25-30 746 2564 
30-35 512 2790 
35-40 376 2984 
40-45 265 3147 


45-50 207 329.1 
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Example. The data show the frequency distribution of the percentage of 
bank loans devoted to industrial loans in a population of 13,435 banks of the 
United States (McEvoy, 1956). The distribution is skew, with its mode at the 
lower end. In the cum Vf column, 58.9 = V3464, 109.1 = 3464 + V2516, 
and so on. d 

Suppose that we want five strata. Since the total of cum vf is 389.5, the 
division points should be at 71.9, 155.8, 233.7, and 311.6 on this scale. The 
nearest available points are as follows: 


Stratum 
ERNEUT M 
1 2 3 4 5 
Boundaries 0-59, 5-15% 15-25% 25-45% 45-100% 
Interval on cum Vf 58.9 96.6 73.6 85.6 74.8 
iil a ey i el 


The first two intervals, 58.9 and 96.6, are rather unequal, but cannot be improved 
upon without a finer subdivision of the original classes. 


If the class intervals in the original distribution of y are of unequal length, 
a slight change is needed. When the interval changes from one of length d 
to one of length ud, the value of Vf for the second interval is multiplied by 


Vu when forming cum vf. 
Although the mathematics behind the rule is crude, the rule has worked 
stributions [Cochran (1961). In 


well in both theoretical and actual di c 
another rule that does well, Ekman (1959), the boundaries are constructed 


so that W,(y, — Yn—ı) is constant. ; 

The approximate rule has an interesting consequence. From equation 
5A.22, the rule is equivalent to making W,S, approximately constant, as 
conjectured by Dalenius and Gurney (1951). But with W,S, constant, 
Neyman allocation gives à constant sample size 7 = n[L in ali strata. 
Since the optimum is flat with respect to variations in the 7), (peus 5A.D, 
use of the cum Vf rule, taking equal sample sizes in the resulting strata, 1s 
highly efficient. / Y f 

Thus far we have made the unrealistic assumption that stratification can 
be based on the values of y itself. In practice, some other variable x is used 
(perhaps the value of y at a recent census). Dalenius (1957) develops 
equations for the boundaries of x that minimize EWS w given a knowl- 
edge of the regression of y on z. If this regression 1s nonlinear, these 
boundaries may differ considerably from those that are optimum when z 
itself is the variable to be measured. The equations indicate, however, that 
if the regression of y on z is linear and the correlation between y and z is 
high within all strata the two sets of boundaries should be nearly the same. 


Let yoatpete 
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where E(e) = 0 for all x and e, x are uncorrelated. The variance of e 
within stratum A is S,,2. Then the z-boundaries which make V(¥,,) a 
minimum satisfy the equations [Dalenius (1957)]. 


P(x, = Man)” ate Sa] nr 28.7 = B(x, xs Hans)? T Shia ar 285 1 
BSav 1 + Salb Sa BS iav 1 + Sul Sena 


If S,,2/f*S z? is small for all A, these equations reduce to the form (5A.21) 
that gives optimum boundaries for x. But S,?/829,,? = (1 — Prl pi" 
where p, is the correlation between y and x within stratum h. 

Although more investigation is needed, this result suggests that the 
cum V frule applied to x should give an efficient stratification for another 
variable y that has a linear regression on x with high correlation. Some 
numerical results by Cochran (1961) support this conjecture. Moreover, 
if the p, are only moderate, as will happen when the number of strata is 
increased, failure to use the optimum z-boundaries should have a less 
deleterious effect on y. 

À The preceding discussion is, of course, mainly relevant to the sampling of 
institutions stratified by some measure of size. The results are also applic- 
able when the survey contains several variables of major interest, provided 
that all are related more or less to the same measure of size. For instance, 
Suppose that some variables are roughly proportional to the measure of 
size, others to its square root, and others are almost independent of size. 


Selection of boundaries by the cum Vf rule should be roughly optimum 
for the first set of variables and fairly good for the second set (where the 
gains from stratification are smaller in any event). The third set may 
suffer some loss of precision from the use of unequal sampling fractions. 

The situation is different when one set of variables is closely related to 
one measure of size, and another set is closely related to a second measure 
with a markedly different frequency distribution. The general approach 
given in section 5A.4 is applicable, but the best computational methods for 
obtaining the boundaries that meet the desired tolerances for the variances 
have not been worked out. 

In geographical stratification the problem is less amenable to a mathe- 
pana approach, since there are so many different ways in which stratum 
Soup my be formed. The usual procedure is to select a few vari- 
C MAR Re us correlations with the principal items in the survey 
ee iu ination of judgment and trial and error to construct 
DOW oo geod for these selected variables. Since the gains an 
a ion are likely to be modest, it is not worthwhile 

pend a great deal of effort in improving boundaries, Bases of strati- 


fdati a 
cation for economic items have been discussed by Stephan (1941) and 


FURTHER ASPECTS OF STRATIFIED SAMPLING 133 


Hagood and Bernert (1945) and for farm items by King and McCarty 
(1941). 


5A.7 NUMBER OF STRATA 


The two questions relevant to a decision about the number of strata L 
are (a) at what rate does the variance decrease as L is increased? (b) How 
is the cost of the survey affected by an increase in L? 

As regards (a), suppose first that strata are constructed by the values of 
y. To take the simplest case, let the distribution of y be rectangular in the 
interval (a, a + d). Then S,’, before stratification, is d?/12, so that with a 
simple random sample of size n, V(y) = d?|12n. If L strata of equal size 
are created, the variance within any stratum is S,,? = d?/12L*. Hence, for 
a stratified sample, with W, = 1/L and n, = n/L, 

manti STAT ch NA AGE AQ 
VGu) = HEM Sn) FE: (2 L x) = pae B 


n\h=1 


Thus with a rectangular distribution the variance of jj,, decreases inversely 
as the square of the number of strata. Rather remarkably, this relation 
continues to hold, roughly, when actual skew distributions with finite 
range are stratified with the optimum choice of boundaries for Neyman 
allocation. In eight distributions of data of the type likely to occur in 
practice, Cochran (1961) found that the average values of V(gy,)] V(9) 
were 0.232, 0.098, and 0.053 for L — 2, 3, 4, as compared with 0.250, 0.111, 
and 0.062 for the rectangular distribution. 

These results, which suggest that multiplication of strata is profitable, 
give a misleading picture of what happens when some other variable x is 
used to construct the strata. If (x) = E(y | x) is the regression of y on z, 
we may write 

yd) e 

where ¢ and e are uncorrelated. Hence 

S; = S, + sé 
cation of L optimal strata for z may reduce 
or at a smaller rate if (x) is nonlinear. But 
the term S? is not reduced by stratification on z. As L increases, a value is 
reached sooner or later at which the term S? dominates. Further increases 
in L will produce only a trivial proportional reduction in Vgs). 

How quickly the point of diminishing returns 1s reached depends on a 
number of factors—particularly the relative sizes of S,? and S PE and the 
nature of d(x). Only a few examples from actual data are available in the 


By the preceding results, cr 
S42 to S,2/L? if (2) is linear 
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literature. To supplement them, a simple theoretical approach is used. 
Suppose that the optimum choice of stratum boundaries by means of v, 
with samples of equal size n/L in each stratum, reduces V(%,,) at a rate 
proportional to 1/Z*. Thus 


Ue S 
VE) == EW Sa = = (5A.23) 


Suppose also that the regression of y on x is linear, that is, 


y = a + pr +e 
where S,? is constant. Then, 


L JE, L ik 2 I, LS 2L 
Vg) == Ewes = LE y wesa L5 y wa 
N h=1 n h=1 n h=1 


For any set of L strata, Y W, > - . Using (5A.23), we have 


gs; 


> 1 
VGu) = HER 


27,2 

4 s?) = sje MoE 2 (5A.24) 
n LE 

where p is the correlation between y and z in the unstratified population. 

With this model, Table 5A.13 shows V(g,,)/V(g) for p = 0.99, 0.95, 0.90, 
and 0.85 and L — 2 to 6, assuming that relation (5A.24) is an equality. 
The right-hand columns of the table give V(g,))/ V(g) for three sets of actual 
data, described under the table, in which z is the value of y at some earlier 
time. 

The results for the regression model indicate that unless p exceeds 0.95, 
little reduction in variance is to be expected beyond L = 6. Data sets 2 and 
3 support this conclusion, although some further increase in L might be 
profitable with the college enrollment data (set 1). 

To complete this analysis, we require a cost function that shows how the 
cost depends on L. Dalenius (1957) suggests the relation C = LC, + "Cn: 
The cost ratio C,/C,, will vary with the type of survey. An increase in the 
number of strata involves extra work in planning and drawing the sample 
and increases the number of weights used in computing the estimates, 
unless they are self-weighting. In some surveys almost no change is required 
in the organization of the field work; in others a separate field unit is set up 
in each stratum. Whatever the form of the cost function, the results in 
Table 5A.13 suggest that if an increase in L beyond 6 necessitates any 
substantial decrease in n in order to keep the cost constant the increase 
will seldom be profitable. 

The discussion in this section is confined to 


i surveys in which only over-all 
estimates are to be made. 


If estimates are wanted also for geographic 
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TABLE 5A.13 


V(J,)I VE) ^s A FUNCTION OF L FOR THE LINEAR REGRESSION MODEL AND FOR 
Some ACTUAL DATA 


Linear Regression Model Data, Set 


0.90 0.85 


0.392 0.458 0.197 0.295 0.500 
0.280 0.358 0.108 0.178 0.375 
0.241 0.323 0.075 0.142 0.244 
0.222 0.306 0.065 0.105 0.241 
0.212 0.298 0.104 0.212 


0.190 


Type of Data 
Set Data z y Source 
ar ] 
1 College enrollments 1952 1958 Cochran (1961) 
2 City sizes 1940 1950 Cochran (1961) 
3 Family incomes 1929 1933 Dalenius and Gurney (1951) 


j3. Family incomes) Ti saa ME 


subdivisions of the population, the argument for a larger number of strata 


is stronger. 


AFTER SELECTION OF THE 


5A.8 STRATIFICATION 
SAMPLE 


With some variables that are suitable for stratification, the stratum to 
which a unit belongs is not known until the data have been collected. 


Personal characteristics such as age, sex, race, and educational level are 
common examples. The stratum sizes N, may be obtainable fairly accu- 
rately from official statistics, but the units can be classified into the strata 
only after the sample data are known. r ; 
One procedure is to take a simple random sample of size n and classify 
the units. Instead of the sample mean g, we use the estimate Jy = > Win 
where #, is the mean of the sample units that fall in stratum h, and W, = 
N,[N. This method is almost as precise as proportional stratified sampling, 
provided that (a) the sample is reasonably large, say 720, in every stratum, 
and (b) the effects of errors in the weights W, can be ignored (see section 


5A.2). 
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To show this, let m, be the number of units in the sample that fall in 
stratum h, where m, will vary from sample to sample. For samples in 
which the m, are fixed, 


3 W,2S 2 
V(gg) = > Tue 


1 2 
——>W,S,° 
m, N ha 


The average value of this quantity in repeated samples of size n must now 
be calculated. This requires a little care, since one or more of the m, could 
be zero. If this happened, two or more strata would have to be combined 
before making the estimate, and a less precise estimate would be produced. 
With increasing n, the probability that any m; is zero becomes so small that 
the contribution to the variance from this source is negligible. 


If the case in which m, is zero is ignored, Stephan (1945) has shown that 
to terms of order n~? 


qu em ea 


m, 
Hence 


EV] == 5 ws + ESU- ms 


The first term is the value of V(9,,) for proportional stratification. The 


second represents the increase in variance that arises because the m, do 
not distribute themselves proportionally. But 


1 WENA 1 5 2 

abd m= (E) sms 52- Ly ms; 

n i n\n n^ nn, n 
where Sj? is the average of the S,2 and ñ, = n/L is the average number of 
units per Stratum. Thus, if the S? do not differ greatly, the “increase” 
term is about 1/ñ, times the variance for proportional stratification. The 
increase will be small if ñ, is reasonably large. 

This method can also be applied to a sample that is already stratified by 


another factor, for example, into five geographic regions, provided that the 
W, are known separately within each region. 


54.9 QUOTA SAMPLING 

Tn another method that has been widel 
research surveys the m, required in each stratum is computed in advance so 
that stratification is proportional. The enumerator is instructed to con- 


tinue sampling until the necessary "quota" has been obtained in each 
stratum. The most common variables for Stratification are geographic 


y used in opinion and market 
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area, age, sex, race, and some measure of economic level. If the enumera- 
tor were to choose persons at random within the geographic areas and 
assign each to his appropriate stratum, the method would be identical 
with stratified random sampling. A considerable amount of field work 
would be required to fill all quotas, however, since in the later stages most 
of the persons approached would fall in quotas already filled. 

To expedite the filling of quotas, some latitude is allowed to the enumera- 
tor regarding the persons or households to be included. The amount of 
latitude varies with the agency, but, in general, quota sampling may be 
described as stratified sampling with a more or less nonrandom selection 
of units within strata. For this reason, sampling-error formulas cannot be 
applied with confidence to the results of quota samples. A number of 
comparisons between the results of quota and probability samples are 
summarized by Stephan and McCarthy (1958), who give an excellent cri- 
tique of the performance of both types of survey. The quota method seems 
likely to produce samples that are biased où characteristics such as income, 
education and occupation, although it often agrees well with the proba- 
bility samples on questions of opinion and attitude. 


5A.10 ESTIMATION FROM A SAMPLE OF THE GAIN 
DUE TO STRATIFICATION 


When a stratified random sample has been taken, it may be of interest, 
as a guide to the conduct of future surveys, to appraise the gain in precision 
relative to simple random sampling. 

The data available from the sample are the values of Np nn Yn» and SS 
From section 5.4, the estimated variance of the weighted mean from the 
stratified sample is 

Wish. y Ms 
N 

The problem is to compare this variance with an estimate of the variance 
of the mean that would have been obtained from a simple random sample. 
One procedure sometimes used calculates the familiar mean square devia- 


tion from the sample mean, 


v(a) = 


» (Yni = gy 
n—1 
his is taken as an estimate of the variance 
ple. This method works well enough if the 
simple random sample distributes itself 
ng strata. But, if an allocation far from 
e actually taken does not resemble 


2 = 


where the strata are ignored. T 
per unit for a simple random sam 
allocation is proportional, since a 
approximately proportionally amo 
proportional has been adopted, the sampl 
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a simple random sample, and this s* may be a poor estimate. A general 
procedure is given. f " 
The true variance of the mean of a simple random sample is 


y. LO —ma.O - 220 — DS + € NO, — Dm 
E Li ea nN (N — 1) 
(5A.25) 
by an algebraic identity for S?. 
In the first term inside the bracket, we need only put s}? for S,2. The 
second term requires investigation. For estimating Y N,(¥, — vy it Is 
natural to try Y N,(9, — j,)?. This quantity turns out to be an overestimate 


that needs adjustment. The relevant result is stated as a theorem, since it 
will be useful later. 


Theorem 5A.1. In stratified random sampling, 


EIS NG — £y] = X N(Y, — Y) + Y SIM $ n 
h 
=E NF,- r+ x9 pa Wa 
n 
Proof. $ 
We may write 


EN, — 5)! = X NG, — F) + On — Y) — Gu — YE 


We now expand and take the average over all possible samples. It may’ be 
verified that the average of each of the two cross-product terms involving 


(Y, — Y) vanishes. This gives 
E > NiAGn = Is) — >” NAY, 22 222 TE > NAD, m Y 
+E » NS TS Yy — 2E pi ACA E Y, Got x Y) (54.26) 
But 
P Nin — Y, m Y) AN n 


by the definitions of y, and Y. Thus the last two terms in (5A.26) coalesce 


to give 
—EN(, — Y} = — y MOS = n) SÈ 
N n, 


since this expression is N times the variance of Ys For the second term on 
the right in (5A.26), 

DET i = 2 2 

EXN- Fy = 5 WO WSE _ yy, n SH 

N, n, n, 
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because within each stratum 9, is the mean of a simple random sample. 
Hence 


EY NAG, — Jy = X NO,— YP 
2 "m 2 
+ > (Nn — Ma) z = È NaN, = m) n) Sa 
h 


N ny, 
= y — yy SN, — nj) zd Ny 
=I Mh D ( - 3) 


yy NS 
=> NK -— YY + 2 au — f) — Wr) 
h 
Corollary. An unbiased estimate of NAY, — Y} is 
2 
E MGs — 8) - EME 40 - W) 
h 


When this expression is substituted in (5A.25), we find that an unbiased 
estimate of V,,,, is 


ran 
N-—n [ 2 Ws 
Ho (yg IR ES MEER 
ran MN DO Wh > m 
y Maiy ymo- Ema] 
na N 
This expression is unattractive to compute. In nearly all applications 
simplifications can be utilized. Two are given. 


N >50. This will hold for almost all populations. The fourth term 
inside the bracket may be omitted, since it equals the first term divided by 


N. We obtain 

N—n : Ws? 
pa ciun cid = 
won = MoS thse xU 


Ly m exu -G way| G2» 
n 


h 


All n, > 50. The second and third terms inside the bracket may be 
h . 


dropped to give 


NA P — E Wn 5A.28 
Dios XN [x Ws? + D Wy È A798 ( ) 
ons are illustrated from the first three strata in the 


(section 5.9). Data from the 1946 sample appear in 
Ilment per coliege, in thousands. 


Example. The calculati 
sample of teachers’ colleges 
Table 5A.14. The means represent enro 
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TABLE 5A.14 
Basic DATA FROM A STRATIFIED SAMPLE 


Stratum Na a In d 
1 13 9 2.200 1.615 
2 18 7 1.638 0.063 
3 26 10 0.992 0.077 
Totals 57 26 


The sample is so small that expression (5A.27) for v 


will be used. The 
supplementary calculations are given in Table 5A.15. 


ran 


TABLE 5A.15 
ARRANGEMENT OF CALCULATIONS 
Stratum Wa Wyse W»s;2|ny W,?s,2[ny, Wrdn 
1 0.228 0.36822 0.04091 0.00933 0.50160 
2 0.316 0.01991 0.00284 0.00090 0.51761 
3 0.456 0.03511 0.00351 0.00160 0.45235 
Totals 1.000 0.42324 0.04726 0.01183 1.47156 


The formulas work out as follows: 

W,2s)2 W,s,2 

——. — Y —" = 0.01183 — 0.00743 = 0.0044 
Np N 


Ust = 


1 
RGN ee 


3 
(5736) 10-4232 — 0.0473 + 0.0118 + 2.4000 — 2.1655] = 0.0130 


Stratification appears to have reduced the variance to about one-third of the 
value for a simple random sample. 


Proportional Allocation. An estimate v 
obtained from the sum of squares of devia 
their mean, for 

g= » (y; — yy 
n—i 


ran that is usually adequate 1S 
tions of the sample values from 


PE. 
ES BE (n, a Ds? + > nj? RS [03 “adh ] 


by the usual identity in the analysis of variance. If terms in 1/m, are 
negligible, this is equivalent to 


È Ws? + > Wg — (E Wy 
i p (UD ; 
since W, = mn. This in turn equals the quantity inside the bracket in 
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5A.28. Thus the expression 
(N — n) 
N 


Vra n 


= 1% 


-is satisfactory if allocation is proportional and terms in 1/7, are negligible. 


5A.11 ESTIMATION OF VARIANCE WITH ONE UNIT 
PER STRATUM 


If the population is highly variable and many effective criteria for strati- 
fication are known, stratification may be carried to the point at which the 
sample contains only one unit in each stratum. In this event the formula 
previously given for estimating V(j,;) cannot be used. An estimate may be 
attempted by grouping the strata in pairs. In the two strata that form a 
pair we shall assume that the sizes N, are equal. Slight deviations from 
equality do not vitiate the method. The population means Y, for the two 
members of a pair should not differ greatly, but the allocation into pairs 
should be made before seeing the sample results, for reasons that will 
become evident. The number of strata should be at least 20, to allow a 
minimum of 10 df in the estimated variance. 

Let the observations in a typical pair be y;j, Vj: where j goes from 1 to 
L[2. Then, averaging over all samples from this pair, 


ES = N,-1 
Elya — Vis = (Yn — Ya) ar N (Sy? + Sy" (SA.29) 


j 


where N; = N, is the size of each stratum in the pair. Consider the 
estimate 


1 22 i 
(Ya) = v XNj(a — V) (54.30) 
N* ja 
By (5A.29) the expected value of this quantity is 
1 L à L/2 vee oe 
5g) = d [sm — 05? £205 Ya) GASD 
E = 


The first term on the right is the correct variance (by theorem 5.3 with 
n, = 1): the second represents a positive bias. The size of the bias depends 
on the success attained in the formation of pairs of strata whose true means 
differ little. The form of estimate (5A.30) warns us not to construct pairs 
by making the sample values differ as little as possible, since this gives a 
serious underestimate. The technique is sometimes called the method of 


“collapsed strata.” 
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In an alternative method of sampling each pair is used as a single stra- 
tum, with L/2 strata and two units chosen at random per stratum. An 
unbiased estimate of V(¥,,) for this kind of sampling is obtainable from the 
usual formula. The reader may verify that 


ahs PZ a ise ue Lg, vy 
VG = | 3:5, — D2 52 4 Sv Met NL pin — Eu) 
(54.32) 
By comparison with (5A.31), it appears that formula (5A.30) overesti- 
mates not only the true yariance with one unit per stratum but also the 
variance that would apply if strata twice as large were used. 
Whether the smaller strata are preferable, in the light of this result, is 
debatable. Unfortunately, if there is a large gain in precision from one 


unit per stratum, compared with two units per pair, there is also a large 
overestimation of the variance. 


54.12. SHORT-CUTS IN THE COMPUTATION OF 
STANDARD ERRORS 


One of the merits of probability sampling is that a standard error can be 
computed for any estimate made from the sample. Unfortunately, the 
computation of standard errors is more laborious than the computation of 
the estimates themselves. In complex national surveys containing hun- 
dreds or thousands of items, calculation and presentation of standard 
errors is a major problem. A number of devices have been proposed to cut 
down the labor and expense. The extent to which any device is helpful 


will of course depend on the computing machines and methods that are 
being used. 


As a reminder, the formulas for the estimated population mean and 
total are as follows: 


£ L Ws? 
Vu) = 3729» (1 g) GA) 
A on, 
LN2e2 
V(f,) = MS (1 — f) (54.34) 
À—3 Ny 
Consider first a survey 
and the sample is rela 
markedly from stratu 
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of S,? if there is some systematic pattern in the order in which units are 
listed in the record book (see section 8.3). 

If the units are listed in an essentially random order, a second possibility 
is to record subtotals as the units are being added to compute the sample 
total in the stratum. For instance, if n, = 123, we might subtotal after 
every 10 units, giving 12 subtotals Tj, T2,---, Ts. The last three units 
are ignored in the subtotals, though they are included in the sample total. 
Then X(T, — T)?/110 provides an estimate of S,?, with 11 d.f. If the 
distribution of y is non-normal, this estimate is likely to be more precise 
than an estimate computed from 12 random units because the use of 
subtotals diminishes the effect of kurtosis (section 2.14) on V(s*). 

Suppose now that the strata are numerous and the sample sizes are small 
within strata, as, for instance, in geographic stratification covering a wide 
area. Fora large group of strata it may be reasonable to assume that Sj? 
varies little from stratum to stratum. We might then draw a random sub- 
sample (e.g., of eight strata out of 40 to 50), compute 5,2 in each of the 
eight strata, and form a pooled s}? to be used for all the 40 to 50 strata. 
This method is riskier than that of estimating S,” in every stratum by means 
of a random subsample of two or three units and is preferable only if it is 
time saving. : 

Other methods are possible when 7, is constant in all strata or in a large 
group of strata. Suppose that m = 2 and N, is constant. Let yj; Yno be 
the measurements on the two units, and let d, = yj1 — Vas Then 


E(d,") i 2s," 


Hence an unbiased estimate of V(,,) over this group of strata is 
F L d,? 
(Pa) = Nr “MAE 


n i i i 34). 
This result may be verified by comparison with (5A.34). 
This ii contains L df—more than is needed if L is large. The 


ich may contain different num- 
strata can be grouped to form k groups, whic f i 
Qm aah h 8 L "i D, = Y, taken over the strata in the jth group. Then 
i es ; d estimate of (f, with & df, is 


It is easily shown that an unbiase 
k D? 
v( fa) T Nyt -AF 4 


Suppose 
i tant and greater than 2. Supp 
à d Tor the estimate v( 14. Divide the 
entity the four units in any co? 
fourth units, respectively. In 


Similar methods apply when 7) 1 
that n, = 4 and that 18 df are desire 
Strata into six groups. In each stratum id 

Snient way as the first, second, third, and 
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group j let Tj, Tjo, Tz, Tja be the totals over the first, second, third, and 
fourth units, with mean 7;. Then 

1 6 4 (m. = TYE 

(Pd) = N20 —f,) > Ye 


j=l u=1 12 


ut. : iN 
The general divisor in place of 12 is (n ) i 

If n, is constant but N, is not, the grouping method becomes mor 
laborious. Instead of the totals T, we form totals 


Y= > Nie U= Lm) 


; uem i e 
where y,,,, denotes the unit identified as the “uth in the stratum anm ; 
sum is over the jth group of strata. With k groups, it can be shown 


[3 x (£5 F. =») = 5 MiSs 
j=1 u=1 na(n, — 1) A-1 Ny, 
the df being k(n, — 1). The quantity on the left is an overestimate of 
V(¥,,), since its expectation does not contain the correct fpc term 1 m 
If necessary, the estimate may be multiplied by 1 — n/N as a roug 
correction. ; ated 
Ifk = 1, then, quantities f, are all unbiased estimates of the popula af 
total Y and the method provides an estimate of V(Y,,) based on n, — 1 df. 
In surveys containing many items the relation between s(¥) and Í m 
be substantially the same over a large class of items. This can be investi 
gated by plotting s( f) against f for different items and seeking scales y 
Which a simple relation gives a good fit. An "error graph" of this kind, a 
Yates (1960) has cailed it, is extremely helpful. If a good one exists, there 1s 
less danger in computing s.e.'s from small numbers of df. For less Nd 
tant items, the computation of the s.e, may be omitted and the graph relie 
on to give an estimated s.e. Use of the graph or a table derived from it 
presenting s.e.'s may avoid the printing of hundreds of individual figures. 
` Examples of error graphs have been given by Yates (1960) and Hansen et 
al. (1953). 
In repetitive studies the relation between s(f) an 
or change only slowly over time, 
further savings. When an error graph 
of s.e.’s in the future need be made 
only the amount of detail required to 
shape or position of the error graph. 
Zarkovic (1960) 
Keyfitz (1957) pres 


d f may remain stable 
This provides an opportunity for 
has been constructed, computation 
only at certain time intervals, with 
detect an important change in the 


gives a review of short-cut computing methods and 
ents ingenious methods usable when n, — 2. 
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5A.13 STRATA AS DOMAINS OF STUDY 


This section deals with surveys in which the primary purpose is to make 
comparisons between different strata, assumed to be identifiable in advance. 
The rules for allocating the sample sizes to the strata are different from 
those that apply when the objective is to make over-all population esti- 
mates. If there are only two strata, we might choose 7, n; to minimize the 
variance of the difference (y, — 7) between the estimated strata means. 
Omitting the fpc's for reasons given in section 2.12, we have 

2 2 
V8; 9) Ep oe 
n Ne 


With a linear cost function 


C = cg + ean + Cole 
V is minimized when 
nS, LET 
n= "LOTO UE ny = TIGE (5A.35) 
S e ar Syl c2 Sy e + ERES 


With L strata, L > 2, the optimum allocation depends on the amounts of 
precision desired for different comparisons. For instance, the cost might 
be minimized subject to the set of L(L — 1)/2 conditions that VG, — y) 1 
V,,,, where the values of V,; are chosen according to the precision considered 


necessary for a satisfactory comparison of strata hand i. Eu 
Frequently a simpler method of allocation is adequate, especially if the 


S, and c, do not differ greatly. One approach is to minimize the average 
variance of the difference between all L(L — 1)/2 pairs of strata, that is, to 
minimize 
2 2 S 2 
p-i( m. 0E) 


Lim Ng ny 


V is minimized, for fixed C, by the rule in (5A.35), 


This rule may result in certain pairs of strata being more precisely com- 
pared and others less precisely than is felt appropriate. An alternative is to 
select the n, so that the s.e. of the difference is the same, say VV, for every 
pair of strata. This amounts to making 5;?/n, = V/2 for every stratum. 
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For a fixed cost this method gives less over-all precision than the first 
method. The reader may verify that the two optimum allocations give 


p.XESWe* y XXSÓ 
BUT (CSS c (€ — cy) 
It follows from the Cauchy-Schwarz inequality that V is always greater 


than V unless S, V/c, = constant. If V is substantially greater than Ñ, a 
compromise allocation can sometimes be found, after a little trial and error, 
that will give an average variance close to V and also keep V(¥, — gy?) 
reasonably constant. 

Sometimes the objective is to obtain estimates for each stratum as well 
as over-all estimates for the whole population. In planning the survey, We 
might specify the following conditions: 


2 20 2 
VG) = 3-q — 5) « v, Va) = X*33a-g <V 
h h 


The fpc terms are now included, since the purpose is to specify the pre- 
cision with which the means in the finite population are to be estimated. 
The conditions on the V(g,) determine lower limits to the values of the 
n, If these lower limits are found to satisfy the condition on V(j,)), the 
allocation problem is solved. When the condition on V(g,,) is not satis- 
fied, Dalenius (1957) has indicated a graphical approach. 


54.14. ESTIMATING TOTALS AND MEANS 
OVER SUBPOPULATIONS 


Frequently the subpopulations or domains of study are represented in 
all strata. If stratification is geographic, for example, separate estimates 
may be wanted, over the whole population, for males and females, for 
different age groups, for users and nonusers of Blank’s toothpaste, etc. The 
problem presents some complications. The basic formulas were given by 
Yates (1953) with further discussion and proofs by Durbin (1958) and 
Hartley (1959). Methods applicable to a single stratum are discussed in 
sections 2.10 and 2.11. 


The following notation applies to the units in stratum A which lie in 
domain‘. 


Notation. 


Number of units: Np; > Nas = N, 
7 


Number in sample: mnp ms =n, 
j 
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Measurement on individual unit: Ypi; 


S ULP SH Yanai 
ample mean: Yr; = zu 
i=l My; 
Ni 
D : 2. y LO bi 
omain mean: Y, = > —E 
ON, 


The population total and mean for domain j over all strata are, respectively, 


y ga Y 
Y; = Nas Yass WES 
h N; 
where N; = > Ny; 
h H 
The complication arises because the 5, are random variables. If the 
N,; were known, the problem would be simple. As estimates of Y; 


and Y,, we could use 


f; 
P = Y Nya aes 
4 abd $ N; 


As shown in section 2.10, the ordinary formula for V(g,;) is still valid. 
Thus 
2e 2 
V(fy) - Y Nani Shj ( es Ai) 
a Th Ny; 


where S,;? is the variance among units in domain j within stratum A. In 
applications, however, the N,; are rarely known. 


Estimating Domain Totais 
In default of the N,;, each stratum total of the domain is estimated as 
in section 2.11. These totals are added to obtain an estimated domain 


total, that is, 


The true and estimated variance of f; are found by the device used in 
section 2.11. A variate yp; is introduced which equals Yri; for all units in 
domain j and equals zero for all other units in the population. As shown 
in section 2.11, this gives for the estimated variance 


TA Syl)? 
of) = va -m| Ray — uer] 


^ n(n, — 1) ay 
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i i in the finite 
hich may be called the linear regression coefficient of y on z in t 
W H ed H . 
population. The resulting minimum variance is 


IEE 
Vois) = S F p) 


where p is the population correlation coefficient between y and z. 
Proof. Yn expression (7.4), for V(g,), put 


S 
Doa Border 
This gives 


m S- 2(Sya2 Sys ‘)] 
Vai = Ls; — 25, (Ss + 2) s. iss +2484 + d 


Si : Js: z3 Sut 4 #sz| (1.6) 


Clearly, this is minimized when d = 0. Since P = S,,2/S2S,°, 


2 


Vis (Bis) = 2 =s S,7(1 — p?) (7.7) 


The same analysis may be used to show how far b 
wi 


o can depart from B 
ithout incurring a substantial loss of precision. Fr 


om (7.6) and (7.7), 


VG) = = [S50 — p*) + (b, — By*s,2) 


" (by — Bis 

= io [i + a= BIS? 
S71 = p) 

Since BS, = pS,, this may be written 


DA E b oy pg 

Gi) = Vua + (2 — Ju ll 
Thus, if the Proportional increase in Variance is to be less than a, we 
must haye 


2 i | < Vall — p/p? (1.8) 
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For example, if p = 0.7, the increase in variance is less than 10%, (x = 0.1), 
provided that 


2 =i | < AK(0.1)(0.51)((0.49) = 0.32 


Expression (7.8) makes it clear that in order to ensure a small proportional 
increase in variance 5,/B must be close to 1 if p is very high but can depart 


substantially from 1 if p is only moderate. 


7.3 REGRESSION ESTIMATES WHEN b IS COMPUTED 
FROM THE SAMPLE 


Theorem 7.2 suggests that if b must be computed from the sample an 


effective estimate is likely to be the familiar least squares estimate of B, 
that is, 
Xo — 96-9 
y; — X — T 1 
p- ——__—_ (7.9) 
DACA m zy 


i=l 

The theory of linear regression plays a prominent part in statistical 
methodology. The standard results of this theory are not entirely suitable 
for sample surveys because they require the assumptions that the popu- 
lation regression of y on © is linear, that the residual variance of y about 
the regression line is constant, and that the population is infinite. If the 
first two assumptions are violently wrong, 4 linear regression estimate will 
probably not be used. However, in surveys in which the regression ofy on 
x is thought to be approximately linear, it is helpful to be able to use Ju 
without having to assume exact I dual variance. 


inearity or constant res! 
Consequently, we present an approach that does not demand that the 
regression in the population be linear. 


The results hold only in large 
samples. They are analogous to the large-sample theory for the ratio 


estimate. [ Y 
First we show that in samples of size n the quantity (b — B) is of order 

1/Vn. Define the variate e; by the relation 
y- Y- Be- ®) (7.10) 


e= 


It follows that 


N N a- 2^ N i 
Dela — X)2 X(W— Y(z-—X- B. — X) (741) 
i=l i- 


i=l 


148 SAMPLING TECHNIQUES 


Estimating Domain Means 


In order to estimate the domain mean Y,/N;, a sample estimate of N; is 
required. An unbiased estimate is 


Hence we take 


= ; n i 
a pe h 
ly N, 
i 
> n,j 
h Ny, 


With proportional stratification, Y; reduces to the ordinary sample mean 
of the units that fall in domain j. In the general case, this estimate is known 
as a combined ratio estimate, discussed later in section 6.11. To show it, 
introduce another dummy variate xp which equals 1 for every unit in 
domain j and 0 for all other units, where i now goes from 1 to N;. Clearly, 


7A u naj 


D Eri 1 > Yni o yw 


ny n, n, Ny, 


= M j (5A.36) 
n, 
so that the estimated domain mean may be written 


N, 
^ EX YN. 
yn h Nh d om 


j Lem S 


Ni, ii > N=)! V Ew 

> ms h 

à n, 

This is the formula for the combined ratio estimate for the two variables 


Yni and x,,’. From section 6.11, the estimated variance may be expressed 
approximately as 


RI Nal) SA Esp ibe SRY 
AY) gib in og UM - Yr — Gai — Y)y GASD 


The second summation may be written 

T er By tay TM A 2 b 
> (i; — Y) — mG, — Yiz) = Zia Y Ser (rn — Yy 
i i n 


(5A.38) 


using (5A.36). Further, the first term in (5A.38) can be expressed alter- 
natively as 


hy 


2 nis — Tas? + Ma Yn3 — Yy 
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Inserting these results in (5A.37) gives, finally, for the estimated variance, 


((Y;) = m > NUM (nis — Ja + nj = m) Gaz = £y] 5 
ASA m(n — 1) Le \ ny, 
(54.39) 
The term on the right represents a between-stratum contribution to the 
variance. Differences among strata means are not entirely eliminated from 
the variance of the estimated mean of any subpopulation. The between- 
stratum contribution is small if the terms 1 — n,,/n, are small, that is, if 
the subpopulation is almost as large as the complete population. 
As Durbin (1958) has pointed out, (5A.39) applies also to means esti- 
mated for the whole population, if the sample is incomplete for any reason 
such as nonresponse, provided, of course, that Y, is the estimate used. In 


this event Y, is interpreted as the estimated mean for the part of the popu- 
lation that would give a response under the methods of data collection 
employed. There is, however, an additional complication, in that the 
“nonresponse” part of the population often has a different mean from the 


Thus Y, is a biased estimate of the mean of the whole 


“response” part. 
d in (5A.39). 


population, and this bias contribution is not include 


EXERCISES 


5A.1 In planning a survey of sales in a certain type of store, with n = 550, 
good estimates of S, are available from a previous survey in two of the three 
strata. The third stratum consists of new stores and stores that had no sales in 
the previous survey, so that a value for S; has to be guessed. If Ss is actually 10, 
compute V(g,,) as given by an estimated Neyman allocation when Sj is guessed 

onal increase in variance 


as (a) 5, (b) 20. Show that in both cases the proporti 
over the true optimum is slightly over 275: 


True Estimated S, 

Stratum Wy, S, (a) (b) 
1 0.3 30 30 30 

2 0.6 20 20 20 

3 0.1 10 Sj 20 


5A.2 Show that if all Sn, except Sp, are correctly estimated and S; is 
estimated as $, = Sz(1 + 4), the proportional increase in Vip(Gs), using Sr, 
instead of the true Sz, for Neyman allocation, is 
nz (n — ng’) 
(+ 4r? 
stratum L under true Neyman allocation. Verify 


Where nz’ is the sample size in - t V 
e results in exercise 5A.1. (The agreement is not 


that this formula agrees with th 
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by definition of B. Now 


me — &) 
= m I5 
-1 
Ms RUY + Bla — X) + eXz; — 2) using (7.10) 
LY; = 2) 
ici 
MP ue 
3G. > zy 


By theorem 2.3, Deuce, — E)" — 1) is an unbiased estimate of 
E : 
Dex; — X)(N — 1), which by (7.11) is zero. In repeated samples of size 7 
the sample covariance Se; z, — %)/(n — 1) is therefore distributed mic 
Zero mean. The standard error of a sample covariance is known um be 5 
order 1/Vn. Thus, in samples of size n, Xe(r, — z)/(n — 1) will be b 
order l/Vn. But the quantity Z (x, — z)*/(n — 1) = s,? is of order unity 
in samples of size n. Hence, from (7.12), (b — B) is of order 1/ Vin. 


Theorem 7.3. If b is the least Squares estimate of B and 


Ty d E z) oe 
then in simple random samples of size n 


VG) == sa — ge gan 


provided that n is large enough so that terms of order 1 [Vn are negligible. 
Proof. By averaging (7.10) over the units in the sample, we have 


é-2g— Y — Be — £) 
Substitution for 7 into expression (7.13) for Jı gives 

= V4 BYX-y +2 co 
From (7.10) it is clear that the population mean of the e; is zero. Hence € 
is of order 1/Vn. But we have shown that (b — B) is of order 1/V/- 
Since (X — 2) is also of order 1/Vn, their product (b — B) (X — z) is of 
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order 1/n. Consequently this product can be ignored relative to e if terms 
of order 1/V/n are negligible. This gives 


p Swe 


Since E(é) is zero by (7.10) and theorem 2.1, E(é) is the variance of the 


mean of the quantities e; in à simple random sample. Hence, by theorem 
22, 
€; 
= 1-fe 1-fá&- 
VG.) = 332 (7.16) 
(Fir) = e Sige 
Now 


Be = 2 = Kah 
= $o. —-Yy- 2$ Y)(a,— X)* XC — Xy 
=J- Y-B EE- 

by the definition of B, equation 7.5. But 


ue [S@— x? 
Ql ka Wea B= TNS 
BA S (y, YY 


Vege ene 
ae DAC? PS (a 


N i A 
Xe-2XG- yl — p) 
so that, finally, 


Vw x = sja- p (77) 


As a sample estimate of Vu), valid in large samples, we may use 


vg) = Sf Siw — gy — W% A (7.18) 


a — "EO i=l 
[5 (y, — 9X2 — =] (1.19) 


xf lS G4 y= YG- 9 


cum = MORE 
rt-cut computing formula. The derivation is 


the latter being the usual sho 
as follows. 


In theorem 7.3, equation Pu we had 


V(9) = —D s; 
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exact because of the rounding of the 7, to integers.) Hence show that a 50% 
underestimation of S; has the same effect as a 100% overestimation. 

5A.3 If there are two strata and if ¢ is the ratio of the actual n/n; to the 
Neyman optimum 7,/n2, show that whatever the values of N,, No, S}, and Sp, the 
ratio Vmin (7,)/ V(9.;) is never less than 4¢/(1 + ¢)?. : 

5A.4 The results of a simple random sample with 1 = 1000 can be classified 
into three "strata," with g, = 10.2, 12.6, and 17.1, s}? = 10.82 (the same in each 
stratum), and s? = 17.66. The estimated stratum weights are Wy = 0.5, 0.3, 0.2, 
respectively. These weights are known to be inexact, but it is thought that all 
are correct within 5%, so that the worst cases are either W, = 0.525, 0.285, and 
0.190 or W, = 0.475, 0.315, and 0.210. By the methods of section 5A.2, would 
you recommend stratification? (Where needed, assume that Tn = Y, and s? = 
Si.) 

5A.5 A survey with three strata is planned to estimate the percentage of 
families who have accounts in savings banks and the average amount invested 


per family. Advance estimates of the percentages P, and the within-stratum Sh 
of the amount invested are as follows. 


X Stratum Wy, 121674] S,(S) 
1 0.6 20 90 
2 0.3 40 180 
3 0.1 70 520 


Compute the smallest sarnple sizes n and the n, that satisfy the following require- 
ments: (a) The percentage of families is to be estimated with s.e. = 2 and the 
average amount invested with s.e. = $5, (6) The percentage of families is to be 
estimated with s.e. = 1.5 and the average amount invested with s.e. = $5. 


5A.6 The table at top of p. 151 shows the frequency distribution of a popula- 
tion of 911 city sizes for cities from 10,000 to 60,000, arranged in classes of 2000. 
To shorten the calculations, a coded y' and values of Vf, cum. Vf, cum.f, fy’, and 
Efy? are given. Apply the Dalenius-Hodges rule to create two strata for 
optimum allocation in the sense of Neyman. Find the values of W, and Sp 
for each of your strata. Verify (a) that the optimum sample sizes are almost the 
same in the two strata and (b) by finding 5? for the whole population, that 


VQ) 


Volid ^ 


5A.7 The right triangular dist 


ribution f(y) 22(1 —5), 0 «y <1, is 
divided into two strata at the point 


4. (a) Show that 


W, —a(2 — a), W.= (1 — a} 
2 2X6 — 6a + a) (1 — a 
eae? = 18 


(5) Show that under the cum. Vf Tule the best c 
and that with this bound: 


27% of the value given b 


l hoice of a is 1 — 1/3/4 = 0.37 
ary the optimum 7,/n is about 27 and V(j,)) is about 
y simple random sampling. 
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f y' Vf | Cumf Cumvf fy 


10— 205 0 14.3 205 14.3 0 
12— 135 1 11.6 340 25.9 135 
14— 106 2 10.3 446 36.2 212 
16— 82 3 9.1 528 45.3 246 
18— 61 4 7.8 589 53.1 244 
20— 42 5 6.5 631 59.6 21085 
22— 32 6 5.7 663 65.3 192 
24— 30 7 5.5 693 70.8 210 
26— 27 8 52 720 76.0 216 
28— 18 9 4.2 738 80.2 162 
30— 22 10 47 760 84.9 220 
32— 21 11 4.6 781 89.5 231 
34— 19 12 -4.4 800 93.9 228 
36— 16 13 4.0 816 97.9 208 
38 — 14 14 3.7 830 101.6 196 
40— M. 3" 4.1 847 105.7 255 
42— 9 16 3.0 856 108.7 144 
44— 8 17 2.8 864 111.5 136 
46— 11 18 3.3 875 114.8 198 
48— 9 19 3.0 884 117.8 171 
50— 7 20 2.6 891 120.4 140 
52— 4 21 2.0 895 122.4 84 
54— 5 22 2.2 900 124.6 110 
56— 5 23 2.2 905 126.8 115 
58— 6 24 24 911 129.2 144 
Totals 911 129.2 4407 


> fy? = 50,395 


5A.8 A sum of $5000 is available for a stratified sample. In the notation of 
section 5A.7 the cost function is thought to be, roughly, C = 200L + 107 and 


2 
va == 5 To Z 


Where p is the correlation between the variate used to construct the strata and the 
variate to be measured in the survey. Compute the optimum L for p = 0.95, 
0.9, and 0.8. What isa good compromise number of strata to use for all three 
values of p? 

5A.9 The following data are derived from a stratified sample of tire dealers 
taken in March 1945 (Deming and Simmons, 1946). The dealers were assigned 
to strata according to the number of new tires held at a previous census. The 
sample means g, are the mean numbers of new tires per dealer. (a) Estimate the 
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n m S : in 
in in precision due to the stratification. (b) Compare this result with the ga 
hat would have been attained from proportional allocation. 


Stratum t š $ 

Boundaries Na W, In Sh ih 
1-9 19,850 0.8032 4.1 34.8 3000 
10-19 3,250 0.1315 13.0 92.2 600 
20-29 1,007 0.0407 25.0 174.2 340 
30-39 606 0.0245 38.2 320.4 230 
Totals 24,713 0.9999 4170 


5A.10, Fora population with N = 6, L i : 
the first stratum ids, 6,9 in the second stratum. Compute (a) V() for a simple 
random sample with n = 2, (b) V(9,,) for a stratified random sample with one 


unit per stratum, (c) the average value i(j,) as estimated by the method of 
collapsed strata. Verify that o(7,,) > V(g) 


= 2, the values of y,; are 0, 1, 3 in 
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CHAPTER 6 


Ratio Estimates 


6.1 METHODS OF ESTIMATION 


eR s fa 
One feature of the growth of theoretical statistics is n OK COMME. E 
large body of theory which discusses how to make goo n mnie 1 
data. In the development of theory specifically for samp ie o prit pd 
use has been made of this knowledge. I think there are x d NY 
reasons. First, in routine surveys that contain a large num uis s 
there is a great advantage in an estimation procedure mos ah gi 
more than simple addition, whereas the superior methods o y pa iés of 
statistical theory, such as maximum likelihood, may pe de ran 
Successive approximations before the estimate can be found. M * p ihe 
has been a difference in attitude in the two lines of research. Mo: bur 
estimation methods in theoretical statistics take it for granted wt s 
know the functional form of the frequency distribution ei à i 
data in the sample, and the method of estimation is carefully EN PR c 
type of distribution. The preference in sample survey theory d B it is 
make only limited assumptions about this frequency distribution ( Trac 
Very skew or rather symmetrical) and to leave its specific functiona Ts 
out of the discussion. This attitude is a reasonable one for handling ae 
veys in which the type of distribution may change from one item to -— 
and when we do not wish to Stop and examine all of them before deci 
how to make each estimate, e at 
Consequently, estimation techniques for Sample survey work ar ds 
present restricted in Scope. Two techniques are considered—the rai 4 
method in this chapter and the linear regression method in tees 
The use of more complex methods may increase, at least in small, speci d 
ized surveys, because the gain in precision from a superior method of es H 
mation can often be secured cheaply, since only the final computations à: 
affected. 
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62 THE RATIO ESTIMATE 


In the ratio method an auxiliary variate z;, correlated with y;, is obtained 
for each unit in the sample. The population total X of the x, must be 
known. In practice, z; is often the value of y; at some previous time when a 
complete census was taken. The aim in this method is to obtain increased 
precision by taking advantage of the correlation between y; and x;. At 
present we assume simple random sampling. 

The ratio estimate of Y, the population total of the y;, is 


fpatx=4x (6.1) 


where y, x are the sample totals of the y; and z;, respectively. 
If x, is the value of y; at some previous time, the ratio method uses the 


sample to estimate the relative change Y/X that has occurred since that 
time. The estimated relative change y/x is multiplied by the known popu- 
lation total X on the previous occasion to provide an estimate of the 
current population total. If the ratio y,/2; is nearly the same on all sam- 
pling units, the values of y/ vary little from one sample to another, and the 
ratio estimate is of high precision. In another application x, may be the 
total acreage of a farm and y; the number of acres sown to some crop. The 
ratio estimate will be successful in this case if all farmers devote about the 


same percentage of their total acreage to this crop. 
If the quantity to be estimated is Y, the population mean value of y,, the 


ratio estimate is 


Y,22X 


Rie 


Frequently we wish to estimate a ratio rather than a total or mean, for 

example, the ratio of corn acres to wheat acres, the ratio of expenditures 
on labor to total expenditures, or the ratio of liquid assets to total assets.. 
The sample estimate is R = y/z. In this case X need not be known. 
The use of ratio estimates for this purpose has already been discussed in 
sections 2.9 and (with cluster sampling for proportions) 3.12. 
Table 6.1 shows the number of inhabitants (in 1000's) in each of a 
Simple random sample of 49 cities drawn from the population of 196 large cities 
discussed in section 2.13. The problem is to estimate the total number of in- 
habitants in the 196 cities in 1930. The true 1920 total, X, is assumed to be 
known. Its value is 22,919. f t 

The example is a suitable one for the ratio estimate. The majority of the 
cities in the sample show an increase in size from 1920 to 1930 of the order 


of 20%. From the sample data we have 
y = Zy; = 6262, æ= 5z; = 5054 


Example. 
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TABLE 6.1 
Sizes or 49 LARGE UNrrED STATES Cities (in 1000's) iN 1920 (»;) AND 1930 (y) 
Ti Yi Fi Yi Ti Yi 
76 80 2 50 243 291 
138 143 507 634 87 105 
67 67 179 260 30 111 
29 50 121 113 71 79 
381 464 50 64 256 288 
23 48 44 58 43 61 
37 63 77 89 25 57 
120 115 64 63 94 85 
61 69 64 77 43 50 
387 459 56 142 298 317 
93 104 40 60 36 46 
172 183 40 64 161 232 
78 106 38 52 74 93 
66 86 136 139 45 53 
60 57 116 130 36 54 
46 65 46 53 50 58 


200 ratio estimates 


Frequency 


Frequency 


| eb ally Oihe 
4 26 28 “30 32 34 36 38 40 42 44 
Total population (millions) 


Fig. 6.1 Ex, rimental 
sample pages. comparison of the ratio estimate with the estimate based on the 
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Consequently the ratio estimate of the 1930 total for all 196 cities is 


Pp = Y x = £26 (22,919) = 28,397 
5054 


The corresponding estimate based on the sample mean per city is 


Y=Ny= Cee = 25,048 


The correct total in 1930 is 29,351. 
Figure 6.1 shows the ratio estimate and the estimate based on the sample 


mean per city for each of 200 simple random samples of size 49 drawn from 
this population. A substantial improvement in precision from the ratio 
method is apparent. 


6.3 APPROXIMATE VARIANCE OF THE RATIO 
ESTIMATE 


The distribution of the ratio estimate has proved annoyingly intractable 
because both y and z vary from sample to sample. The known theoretical 
results fall short of what we would like to know for practical applications. 
The principal results are stated first without proof. 

The ratio estimate is consistent (this is obvious). It is biased, except for 
some special types of population, although the bias is negligible in large 
samples, The limiting distribution of the ratio estimate, as becomes very 
large, is normal, subject to some mild restrictions on the type of population 
from which we are sampling. In samples of moderate size the distribution 
shows a tendency to positive skewness in the kinds of populations for 
Which the method is most often used. We do not possess exact formulas for 
the bias and the sampling variance of the estimate but only approximations 
that are valid in large samples. , : 

These results amount to saying that there is no difficulty if the sample is 
large enough to that (a) the ratio is nearly normally distributed and (b) 
the large-sample formula for its variance is valid. As a working rule, the 
large-sample results may be used if the sample size exceeds 30 and is also 
large enough so that the coefficients of variation of ¥ and y are both less 


than 1057. 


Theorem 6.1. The ratio estimates of the population total Y, the 
population mean Y, and the population ratio Y/X are, respectively, 


E Belay ie 
z z 
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In a simple random sample of size n (n large) 


(y; — 

xa ibe = (62) 

(fg) = == 
Abs (y; — iem 

E: i=l (6.3) 

VFR) = n == N-1 
— Ra, 
V(R) = RE (6.4) 


where f = n/N is the iss fraction. 


The argument leading to the approximate result (6.4) was given in 
theorem 2.5. Since e = ER, f, = NXR, the other two results follow 
immediately. 


Corollary 1. There are various alternative forms of the result. Since 
Y = RX, we may write 


; N*(1— 
Vg = me =f oe (y, — Y) — Re, — DP 
N*(1— Y 
E =) [x (y, — YP + R? X - xy 
—2R X (y, — Ya — HI 
The correlation coefficient p between y; and z; in the finite population is 
defined by the equation 
T " N sy x 
5^ pmHun—Y(m—XX)  Y(w—Y(«—X) 
VE — YE, — X? — (N— DSS 

This leads to the AT 


+ RIS? — 2RpS,S,) (6.5) 
An equivalent form is 
2/g2 
V= p Se 3 (6.6) 
ies DLL eX? YY 
where S,, = pS,S, is th j E na 
ash bee es € covariance between y, and x,. This relation may 


"fa uento, eL 26,0 (6: 
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where C,,,C,, are the squares of the coefficients of variation (cv) of y; and 
£, respectively, and C, is the relative covariance. 


Corollary 2. Since f, Ya, and R differ only by known multipliers, the 
coefficient of variation (i.e., the standard error divided by the quantity 
being estimated) is the same for all three estimates. From (6.7) the square 


of this cv is 
(cv)? = d = a (Ge ECL) (6.8) 


The quantity (cv)? has been called the relative variance by Hansen et al. 
(1953). Its use avoids repetition of variance formulas for related quantities 


like the estimated population total and mean. 


6.4 ACCURACY OF THE APPROXIMATE VARIANCE 


Sukhatme (1954) has investigated the error involved in the approximate 
formula for V(R). It will be recalled (theorem 2.5) that the approximate 


formula was obtained by writing 


and then replacing z by X in the denominator. Instead, we write 


1 1 E x( + z—X j^ E 
z X-(Gg—X X X 
in parentheses by a Taylor's series. This 


and expand the right-hand term 
gives 


z Y z yy? 

CIRI WETEA ea | 6.9 
RR [i AET (6.9) 
By squaring this expression, we can express E(R — Ry in terms of the 
moments of the joint distribution of y and z. Unfortunately, the result is 
too complicated to lead to a useful guide for practical applications and 


will not be given here. 
If y and z follow a biv: 
considerably. Sukhatme (1954) has sh 


= E 6C. *C, yt Cu = 2C. 
(=) = u(i + 2e qr SCa(hCn, + Cua — 2) (6.10) 


ariate normal distribution, the result simplifies 
own that to terms of order 1/n? 


where 
y, = 1(6, 4 C4 — 262 
n 
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is the first approximation to the relative variance of R, as given by ya 
with the fpc ignored. Taking V; as a common factor in (6.10), we ha 


R 3 Cz, P Cus + Cos — 2Cys 
| eS v (i 36 8 ES SS (6.11) 
R n n Ch Cze 26,5 : 
Since the right-hand term inside the parentheses is less than 6C.,,/n, this 


gives 
7] 2 
:(—5) < ¥,(1 + Cz) (6.12) 


n 


to terms of order 1/n?. Now C,,/n is the square of the coefficient of vari- 
ation of z. Thus, if n is large enough so that the cv of z is less than 0.1, a 
of V, should not underestimate by more than 9%. In practice, the ane 
plier 9 in (6.12) appears to be unduly high as compared with (6.11). Fo 
instance, if C,, = C,,, (6.11) reduces to 


p 2 
z(E—5) = nfi 4+ Cee- 3] 
R n 

Since p is almost always positive in applications of the ratio method, A 
multiplier between 3 and 6 is more representative. However, the effects o 
non-normality in y and z also enter into the term of order 1/n?. a 

The expression E(R — R)? is the mean square error of R about the gi 
ratio R rather than the variance of f. Since f is, in general, biased, the 


; A ; o 
mean square error is more appropriate than the variance as a measure 
its accuracy, 
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The ratio estimate has a bias of order I/n. Since the s.e. of the estimate 
is of order 1/ Vn, the ratio of the bias to the s.e. is also of order 1/V/n and 
becomes negligible as n becomes large. In practice, the bias is usually 
found to be unimportant even in samples of moderate size. Three usefu 
results about the bias are presented. 

The first gives the leading term in the b 


ias when it is expanded in & 
Taylor’s series. From (6.9), 


retaining the first two terms, we have 
R- Re Eh e) 
X 


Now 


E — Rt) = Y— R¥=0 
So that the leading term in the bias comes from the second term inside the 
brackets, Further, 


Ek — X) = Eg — yy — yeh 


= ES 
n 
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by theorem 2.3 (p. 24) and the definition of p. Also, 


Ezz— Ý) = Ez — Xy = Ls 


Hence the leading term in the bias is 
ECR — R) = LL (RS? — 05,52) (613) 
n 


The relative bias (i.e., bias/R), which is the same for R, Pp, and Y, is 


ER — R) . 1=S(R5,* — 5,5.) (6.14) 
R no? 


= LENG ae) 


X, and Y can be computed, (6.13) 


Since sample estimates of R, Sa Sy, p. e , (6. 
e size of the bias in a 


and (6.14) are sometimes used as a rough check on th 


Specific sample. ) ¢ . , 
A second result is that the ratio estimate Is unbiased if the regression of y 


and z is a straight line through the origin. This means that E(y | z) Spa 
In a finite population this relation implies that (a) if several units have 
exactly the same value of z, the mean of their y-values is Bx, and (b) if a 
specific value of x occurs on only one unit in the population the value of 
y for that unit is Bx. These relations are unlikely to be satisfied exactly ina 
finite population. However, they will frequently be satisfied approximately, 
since, as mentioned in section 6.2, the ratio estimate is likely to be used 
when there is reason to think that ylļæ is approximately constant. 
Theorem 6.2. If E(y| 2) = fix for all values of z in à finite population, 
the estimates f, Fp, and R are unbiased in simple random samples of 


size n. 
Proof. Write 


y- Bu +e 
Then 
E(e| 2) =0 (6.15) 
for any value of x. Averaging over the population, we have 
-6X 


So that R = $. Further, averaging over à sample, 
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that is, 
R=R+ 


RIIA 


Take the average of R over all samples of size n that contain the same set of 
values of x, so that z remains constant during this averaging. Suppose 
that a particular value of x, say x’, occurs in m’ units in the sample and in 
M' units in the population. Then in this averaging each of the M' units 
appears equally often in the sample. But by (6.15) 


E(e| x') =0 


Hence E(£) = 0 over this set of samples. It follows that E(R) = R over 
this set of samples and that E(R) = R over all simple random samples of 
size n. 

The third result, due to Hartley and Ross (1954), gives an upper bound 
to the ratio of the bias to the standard error. Consider the covariance, 1" 
simple random samples of size n, of the quantities R and z. We have 


cov (Å, z) = (2. 2 — E(R) E(z) 


=Y—YXE(R 
Hence (9 


cov (R, z) = R — II (R, z) (6.16) 


Thus the bias in R is —cov (R, z)/X. Unlike the Taylor approximation 
(6.13) to the bias, this expression is exact. 
Further, 


[bias in R| = 12.2703 
X 


« OTROS 
pax 
since R and z cannot have a correlation >1. Hence 
bias in R " 
[bias in Rl uua Ge of (6.17) 
oR X 


The same bound applies, of course, to the bias in Pp and Y, Thus if the 


cv of z is less than 0.1, the bias may safely be regarded as negligible in 
relation to the s.e. 
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6.6 ESTIMATION OF THE VARIANCE FROM A SAMPLE 


From equation (6.2) 


N 
N° — f) D (y; — Raj 
n N-1 


V(Pp) = 
As already mentioned in section 2.9, we take 


n 
X — Ra 
(n — 1) 
as a sample estimate of the population variance. This estimate has a bias 
of order 1/n. 
For the estimated variance, v( p), this gives 


(Pq) = OC S (y, — Rz 


n(n — 1) ii 
e N(N — n) Sy + RY «2 - 2R Y yx) (6.18) 
n(n — 1) 


this being the form that is speediest to compute. 


Example. This illustrates the calculation of the standard error ofa ratio 
estimate of a population total. The data in Table 6.1 (p. 156) will be used. First 
calculate 


y 
y-X200 *— J v; 5054 Rs 1.239019 


From (6.18), 
«(fg = A yè + Dat RE ye) 


To compute the quadratic term the sums of squares and products are placed on 


the same row as their multipliers: 


Multiplier 
aie Sore 
DY y? = 1,527,882 1 
af = 1,044,504 1.535168 = f 


Dye = 1,251,630 2.478038 = 2R 


Hence 
XP) = aoe (29,784) = 364,854 


s(f,) = 604 
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6.7 CONFIDENCE LIMITS 


If the sample is large enough so that the normal approximation applies, 
confidence limits for Y and R may be obtained: 


1 ETA Oe) (6.19) 
R: ORG VAR) (6.20) 


where f is the normal deviate corresponding to the chosen confidence 
probability. ; 

In section 6.3 it was suggested that the normal approximation applies 
reasonably well if the sample size is at least 30 and is large enough so that 
the cv's of jj and z are both less than 0.1. When these conditions do not 
apply, the formula for v(R) tends to give values that are too low and the 
positive skewness in the distribution of R may become noticeable. ; 

An alternative method of computing confidence limits has been used m 
biological assay (Fieller, 1932; Paulson, 1942). This approach requires 
fewer assumptions than the normal approximation and takes some account 
of the skewness of the distribution of R. 4 

The method requires that j and follow a bivariate normal distribution, 
so that (y — Rz) is normally distributed. It follows that the quantity 

y — Rz 

VIIN — n)/Nn]V5,2 + Ris? — 2Rs,, 
is approximately normally distributed with mean zero and unit standard 
deviation. (We have substituted sample estimates 5,?, etc., for the corre- 
sponding population variances and covariance and are assuming the sample 
size enough so that this introduces negligible error. In biological assay, 10 
which samples may be quite small, the quantity above would be regarded as 
following Student's t-distribution.) 

The value of R is unknown, but any contemplated value of R which 
makes this normal deviate large enough may be regarded as rejected by the 
sample data. Consequently, confidence limits for R are found by setting 
(6.21) equal to +1 and solving the resulting quadratic equation for R. The 
confidence limits are approximate, for, if we try to check them by sampling 
repeatedly from a fixed population with known R, some values of j and 7 


turn up for which the two roots of the quadratic are imaginary. Such cases 
become rare if the cv's of jj and z are less than 0.3. 


After some manipulation, the two roots may be expressed as 


(6.21) 


R= R (1 = Pega) + t (egg + czz = 2cgz) — [427122 — cz) (6.22) 
1 — ces 
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where N " 
— ns 


Nn P$ 

is the square of the estimated cv of y, with analogous definitions of c5; 
and cg If (ej, Peg, and /3c;; are all small relative to 1, the limits 
reduce to 


Cy) = 


R= Ê + tÊV Og + Cn — 26g 
This expression is the same as the normal approximation (6.20). 
Quadratic limits for Y are found by replacing Ê in equation (6.22) by ip 
Although the quadratic limits should in general be more accurate than 
the normal limits from (6.20), since fewer assumptions are required, 
Hajek (1958) has shown that if the regression of y on x goes through the 
origin the normal limits contain R, in large samples, with higher frequency 
than the quadratic limits. 


6.8 COMPARISON OF THE RATIO ESTIMATE WITH 
THE MEAN PER UNIT 


The type of estimate of Y which was studied in preceding chapters is Ny, 
where jj is the mean per unit for the sample (in simple random sampling) 
or a weighted mean per unit (in stratified random sampling). Estimates of 
this kind are called estimates based on the mean per unit or estimates ob- 
tained by simple expansion. 

Theorem 6.3 In large samples, with simple random sampling, the ratio 
estimate f, has a smaller variance than the estimate f — Nj obtained by 
simple expansion, if 
1 ($) / ($) .. coefficient of variation of x; 


Bee 2 Y 7) ~ 2(coefficient of variation of y;) 


Proof. For f we have 


(p= MOP sy 


n 


For the ratio estimate we have from (6.5) 


y ee? 

v c LS a sn PRESS 
n 

Hence the ratio estimate has the smaller variance if 

S + RIS — 2RpS,S. < s 


ie. if 
= 16/6) 
BZS ANONY 
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This theorem shows that the ratio estimate may be either more or less 
precise than a simple expansion. The issue depends on the size of the corre- 
lation coefficient between y; and zx; and on the cv's of these two variates. 
The variability of the auxiliary variate x, is an important factor: if its cv is 
more than twice that of y;, the ratio estimate is always less precise, since p 
cannot exceed 1. When z; is the value of y, at some previous time, the two 
cv's may be about equal. In this event the ratio estimate is superior if p 
exceeds 0.5. í 

Theorem 6.3 applies only for samples large enough so that the approxi- 
mate formula for V(¥,) is valid. In smaller samples the ratio method 
probably does not compare so favorably as the iheorem suggests, since the 
approximate formula is usually an underestimate. 


69 CONDITIONS UNDER WHICH THE RATIO ESTIMATE 
IS OPTIMUM 


A well-known result in the theory of regression indicates the type of 
population for which the ratio estimate is the best among a wide class of 
estimates. The theorem applies to infinite populations. 


Theorem 6.4. With simple random sampling from an infinite popu- 
lation, the ratio estimate of Y is a “best linear unbiased estimate" if two 
conditions are satisfied: 


1. The relation between y, and z; is a straight line through the origin. 
2. The variance of y; about this line is proportional to z;. 


A "best linear unbiased estimate" is defined as follows. Consider all 


estimates that are linear functions of the sample values y,, that is, that are 
of the form 


ly + lys t s Ly, 


where the /’s do not depend on the y;, although they may be functions of 
the x; The choice of the /’s is restricted to those that give unbiased esti- 


mates of Y. The estimate that has the smallest variance is called the “best 
linear unbiased estimate.” 


Proof. The mathematical model is 
y; = Br; + e; 


where the e; are independent of the x,. In arrays in which z; is fixed e; has 
mean zero and variance Àx,. Hence 


EB, 


It was shown by Gauss that the best linear unbiased estimate of BX is 


RATIO ESTIMATES 167 


bX, where b is the least squares estimate of B (see, e.g., David and Neyman, 
1938). The least squares estimate is 


b= 2 Mes where w; = me Ey 
: > wizi Cee m 
This gives 
b= DET = y 
> z XE 


Consequently, the optimum estimate of Y is the ratio estimate (y/z) X. 


The practical relevance of this result is that it suggests the conditions 
under which the ratio estimate is not only superior to the mean per unit but 
is the best of a whole class of estimates. When we are trying to decide what 
kind of éstimate to use, a graph in which y; is plotted against 2; is helpful. 
If this graph shows that the relation is a straight line through the origin and 
if the variance of the points y; about the line seems to increase proportion- 
ally to z; the ratio estimate will be hard to beat. 

Sometimes the relation between y, and z; is a straight line through the 
origin, but the variance of y, in arrays in which z; is fixed is not proportional 
to x; In a population sample of Greece, Jessen et al. (1947) found that the 
variance increased roughly as x, This suggests a weighted regression in 
which w; oc 1/z. For the least squares estimate b, this gives 


pe PDA 2 H (4) 


Ywa n z, 


1 
In this situation the best estimate of Y is bX, where b is the mean of the 
ratios y,/x, on the individual sampling units. 


6.40 RATIO ESTIMATES IN STRATIFIED RANDOM 
SAMPLING 


o estimate of the population total Y 
ratio estimate of the total of each 
are the sample totals in the hth 
this estimate Îr, (s for 


There are two ways in which a rati 
can be made. One is to make a separate 
stratum and add these totals. If Ym Zn 
Stratum and X, is the stratum total of the tri 
Separate) is 

fu = DEX = LEM (6.23) 
h Xp, h Xp, 
rue ratio remains constant from stratum 


No assumption is made that the t 
knowledge of the separate totals X;. 


to stratum. The estimate requires a 
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Theorem 6.5. If the sample sizes n, are large in all strata, 


21 — 2 
Y( Yrs) = x Na a x h) (Sme ot RES = 2RnPnSynSan) (6.24) 
à na 


where R, = Y;/ X, is the true ratio in stratum h, and p, is defined as before 
in each stratum. 
Proof. Write 


Then 
Tp, =f En M) 
Hence 
V(fg) = E(Lp, — Y}? 
=>} Efn- ¥,P +25 R Em — XXfa, — Y) 
D A i> 


Since Pp is the ratio estimate made from a simple random sample within 
stratum A, we may use (6.5) for the approximate variance of Îpy that is, 


N20 — 
V(Frn) = ULL (Swe 3r RS? — 2R,p,S, S.) 
h 


The cross-product terms vanish because the sampling is independent 
in the different strata and, to the order of approximation used in the 
variance formula, f pn is an unbiased estimate of Y,. Result (6.24) follows. 


This formula is valid only if the sample in each stratum is large enough 
so that the approximate variance formula applies to each stratum. This 
limitation should be noted in practical applications. 


Moreover, when the n, are small and the number of strata L is large, the 
bias in f, 


s may not be negligible in relation to its standard error, as the 
following crude argument suggests. 


In a single stratum we have seen (section 6.5) that 
[bias in Ppl 
o( Prr) 


If the bias has the same sign in all strata, as may happen, the bias in 
Rs Will be roughly L times that in f. m But the standard error of fa, iS 
only of the order of VE times that of Ppr Hence the ratio 


S cv of z, 


[bias in f| 


o( f. 
is of order (Fre) 


VL(cv of Ta) 
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For example, with 50 strata and the cv of z, about 0.1 in each stratum, 
the bias in f, might be as large as 0.7 times its standard error. The 
contribution’ of the bias to the mean square error of Pp, would then be 


about one third. 
Although in practice the bias is usually much smaller than its upper 
bound, the danger of bias with the separate ratio estimate should be kept 


in mind if V/L(cv of z;) exceeds say 0.3. 


611 THE COMBINED RATIO ESTIMATE 


An alternative estimate is derived from a single combined ratio (Hansen, 
Hurwitz, and Gurney, 1946). From the sample data we compute 


st— » Nas £, = ENG 
h 


These are the standard estimates of the population totals Y and X, respec- 
tively, made from a stratified sample. The combined ratio estimate, Pp, 
(c for combined) is 

Ty. = f. Aem us x 


st Tst 


where Jy = ¥,,/N, Zs: = XselN are the estimated population means from 


a stratified sample. 
The estimate Fp, does not require a knowledge of the X,, but only of X. 


The combined estimate is much less subject to the risk of bias than the 
separate estimate. Using the approach of Hartley and Ross in section 


6.5, we have, writing R, = also 
cov (&, Fst) = e(# : Za) — E(R) E%x) 
Tst 


LIER) 
Hence 


E(R,) =R- yo" (Ro Ea) 


and 
[bias in A4 = loty Sad < cv of Z,. 
oR, 
Thus the biases in R,, fa, are negligible relative to their standard errors, 
provided only that the cv of Zs is less than 0.1. 
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Theorem 6.6. 1f the total sample size n is large, 


NU st + R?S a? — 2R P SSe) — (625) 


h 


v Ta) = » 


h 


NK 
n 


Proof. This follows the same argument as theorem 2.5. In the present 
case the key equation is 


(Pre =F Gu RE) = NG.- RR) —— (629 
st 


Now consider the variate up: = y,, — Rz, The right side of (6.26) is 
Nit, where à, is the weighted mean of the variate u,, in a stratified sample. 
Further, the population mean Ü of u, is zero, since R = Y/X. 

Hence we may apply to Z,, theorem 5.3 for the variance of the estimated 
mean from a stratified random sample. This gives 


Vr) = NV (ay) = y Oh = m). 


h ny 
where 
1 ^» 

S,  -—— = ? 

h nee ACT h) 
1 ^ = S 
= BEC 2. n, —Y)—- R(z,, — X,)] 

SRSe the quadkatic is 


From equations (6.24) ang (as ttt (625) is obtained, 
mate variances of f ) and (6.25) it is Interesting to note that the approxi- 
ference being that the D a Re assume the same general form, the dif- 
are all replaced by R 5 m AS ratios R, in the individual strata in (6.24) 


We may write 
Vp.) "X V( 1.) 
= N, * CP 
: CENA (Re RY)s,,2 
= N; *ü rg 
> CA a R,)°s,,2 


— XR — Ra) Pr Son Saa] 


, F XR, — Ry(p,S,,5,, — RS] 
In Situations in whic 


ihe h Ton ( 
€ right is usually eae estimate is appropriate the last term 0 
` At vanishes if within each stratum the relatio” 


th 
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between yp; and x» is a straight line through the origin.) Thus, unless R, 
is constant from stratum to stratum, the use of a separate ratio estimate in 
each stratum is likely to be more precise. This discussion assumes, how- 
ever, that the sample in each stratum is large enough so that the approxi- 
mate formula for V(¥p,) is valid. With only a small sample in each 
stratum, the combined estimate is to be recommended unless there is good 
empirical evidence to the contrary. 

For sample estimates of these variances we substitute sample estimates 
of R, and R in the appropriate places. The sample mean squares s,,” 
and s,, are substituted for the corresponding variances and the sample 
covariance for the term p,S,,S,,. The sample mean square and covariance 
must be calculated separately for each stratum. 


Example. The data come from a census of all farms in Jefferson County, 
Iowa. In this example y; represents acres in corn and z; acres in the farm. The 


TABLE 6.2 
DATA FROM JEFFERSON COUNTY, IOWA 
Size 
Strata (farm acres) Na Syn? SaaS SR Ry 
1 0-160 1580 312 494 2055 0.2350 
2 More than 160 430 922 858 7357 0.2109 


For complete pop. 2010 620 1453 7619 0.2242 


Strata Y, X, m Or Wal Vii Ux 
PME A LETT E 
i 19.40 82.56 70 0.008828 193 194 
2 51.63 244.85 30 0.001525 887 907 


26.30 117.28 100 
M 


For complete pop. 


ion is divi i taining farms of as 
Population is divided into two strata, the first stratum con! E ; 
many as 160 acres, We assume a sample of 100 farms. When stratified sampling 
is used, we shall suppose that 70 farms are taken from stratum 1 and 30 from 


stratum 2, this being roughly the optimum allocation. The data are given in 
Table 6.2. The last three quantities, Qj, Vx’, and Vn", are auxiliary quantities 
to be used in the computations, the last two being defined later. 

We consider five methods of estimating the population mean corn acres per 


farm. The fpc are ignored. 


1. Simple random sample: mean per farm estimate. 
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2. Simple random sample: ratio estimate. 
V = CE + R°S2 — 2RS,) 
= 33s[620 + (0.2242)*(7619) — 2(0.2242)(1453)] 
=3.5i 
3. Stratified random sample: mean per farm estimate. 


W;i 
y; p Sy? — Y QS? = 4.16 


4. Stratified random sample: ratio estimate using a separate ratio in each 
stratum. 


V, = D QKSy? + RSen — 2R,8,4) = > Q,V,' = 3.06 
5. Stratified random sampling: ratio estimate using a combined ratio. 
V, = On(Sy? + R'Sy2 —2RS,4) = X Q.V." = 3.10 


The relative precisions of the various methods can be summarized as 
follows: 


Method of Relative 

Sampling method Estimation Precision 
1. Simple random Mean per farm 100 
2. Simple random Ratio 177 
3. Stratified random Mean per farm 149 
4. Stratified random Separate ratio 203 
5. Stratified random Combined ratio 200 


The results bring out an interesting point of wide application. Strati- 
fication by size of farm accomplishes the same general purpose as à ratio 
estimate in which the denominator is farm size. Both devices diminish the 
effect of variations in farm size on the sampling error of the estimated mean 
corn acres per farm. For instance, the gain in precision from a fato 
estimate is 77% when simple random sampling is used, but it is only 36 % 
(203 against 149) when stratified sampling is used. 

In the design of surveys there may be a choice between introducing 4 
factor into the stratification or utilizing it in the method of estimation. 
The best decision depends on the circumstances. Relevant points are: 
(a) some factors, for example, geographical location, are more easily 
introduced into the stratification than into the method of estimation; 
() the issue depends on the nature of the relation between y; and z;. All 
Simple methods of estimation work most effectively with a linear relation. 
With a complex or discontinuous relation, stratification may be more 
effective, since, if there are enough strata, stratification will eliminate the 
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effects of almost any kind of relation between y; and z;. (c) If some impor- 
tant variates are roughly proportional to z;, but others are roughly pro- 
portional to another variate z;, it is better to use x, and z; as denominators 
in ratio estimates than to stratify by one of them. 


613 SHORT-CUT COMPUTATION OF THE VARIANCE 


If n, = 2 in all strata, Keyfitz (1957) has given short-cut methods for 
computing the relative variance of Lp, or Ê.. From (6.25), substituting 
the sample estimates, 


ee) =} Nè — fr) B + Sah Zrasnsa) 


Pro h 2 a d Y4X, 
Let 
Va = Nats , Vis = Nahe > dy, = Vm. — Yna 


with similar definitions for x. Then, with m = 2, it is easily shown as an 


algebraic identity that 
NS ax ae (Sen = ua) = 2) 
2e 0$. 2 


with corresponding expressions for the ter 
the relative variance may be computed as 


ay duy 
Ex 
omitted. If f, is approximately con- 


where f, = D frl L- 


st 


ms in 5,2 and rpSypSen- Hence 


In this form the fpc terms have been 
stant, the multiplier 1 — f, may be applied, 


6.14 OPTIMUM ALLOCATION WITH A RATIO ESTIMATE 


may be different with a ratio estimate 


The optimum allocation of the 7 
first the variate Fg, From theorem 


than with a mean per unit. Consider 
6.5 its variance is 3 


Vr) => NOS s 4 RSet — 2R pi Sa San) 
h h 


— - 1 3 
25 NANa =") 2, with S = ES 2 ant (6.27) 


A Tl 
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From theorem 2.4, an unbiased estimate of S? is 


[sia = À o = ey 


n—1li 


Now, from equation (7.10), 
€; — e = (y; — 9) — B(x, — z) 
= lv, — 3) — 5e; — 3)] + (b — Byz, — z) 


The second term on the right, of order 1 
tion to the first term, which is of o 
may use 


[V/n, may be neglected in rela- 
rder unity. Hence in large samples we 


1 < 2 = 
ar CAS Jy) — b(z, — az) 

n—]1;5 

The divisor (n — 2) instead of (n — 1) is suggested in 
(7.18) and (7.19) because it is used in standard Tegression theory and is 


nown to give an unbiased estimate of S ? if the population is infinite and 
the regression is linear. 


74 ACCURACY OF THE LARGE-SAMPLE FORMULA 
FOR y(y,) 


; Yu Y=6+b— Bg. gj 
Substituting for b from (7.12), 


I, — Vi Bae Gel = 3X — z) 
4 - 
> (z; — zy 
Tession estimate will be used to find 
bias ofj,. For the variance, write 
ny x4 "CERO S 2] 
T - d E E RAT A 


n È (z; — zy 
Hence, in arrays in which the x, are fixed, 


(7.20) 
This expression for th 


€ error of the Teg: 
the leading terms in t 


he variance and 


— 
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where it is assumed that S,? is the same for all z. The average value of this 
quantity over different selections of the z; depends on the shape of the 
frequency distribution of z. If the z's are normally distributed, the average 


is 
lea) 
(1+ 
n n—3 


For a general distribution of x, the average may be shown to be (Cochran, 
1942), to terms of order 1/n3, 4 


d S3 1.3) 2y,? 
VG) = Sz (i that LAS (721) 


where y,? = k,2/S,3 is the measure of relative skewness of the distribution 
of x. 

Reverting to (7.20), the bias in ğ, arises from the second term on the 
right, since the average value of e is zero in simple random sampling. To 
obtain the leading term, we may replace £ (x; — z)* by its leading term, 
nS. Also write 

n 


5 e(z; — Z) = > e(z, — X) + n&(X — 2) 


Hence the leading term in the bias is the average of 


D (x; ue = 2) + = 20) (7.22) 
nS, z 


Let u; be the variate ez; — X). From (7.11), its population mean 

Ü — 0. The average of the first term in (7.22) may therefore be written 
Ti Y 2 
Ea- 0@— X). Eu- DeD Ee XY 723 
sg eg nS2 nS, 
z 

by theorem 2.3, (p. 24), letting N — œ. The average of the second term 
in (7.22) is easily shown to be of order 1 [r?. Thus the leading term in the 
bias comes from the population covariance between e and (x — X)*; it 
represents a contribution from the quadratic regression of y on x and 


vanishes if the relation between y and z is linear. _ : l f 
If p, denotes the correlation between e and (x — X)*, the leading term in 


the bias may be expressed, alternatively, as 


E PEE? (123y 


n 


since the variance of (z — X)* is known to be S,*(2 + ya), where y, = 
t4/S,4 is Fisher's measure of relative kurtosis. 
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where d, = Yn; — Rp%; is the deviation of yp: from R,z,,. By the methods 
given mE 5 for finding optimum allocation, it follows that (6.27) is 
minimized subject to a total cost of the form Xc,n,, when 


N,S, 
n, oc Ea 
Cn 


With a mean per unit it will be recalled that for minimum variance 71 1$ 
chosen proportional to N,S,,/ V cp- ; 

In the E inr ofa eiae the allocation with a ratio estimate D 
appear a littie perplexing, because it seems difficult to speculate about t A 
likely values of S,,. Two rules are helpful. With a population in whic 
the ratio estimate is a best linear unbiased estimate, San Will be roughly pro- 
portional to VX, (by theorem 6.4). In this case the n, should be propor- 
tional to N,V Y,/V/c,. Sometimes the variance of d,, may be more nearly 
proportional to X;?. This leads to the allocation of 7, proportional to 
Np Xal V ey, that is, to the stratum total of z,,, divided by the square root of 
the cost per unit. An example of this type is discussed by Hansen, Hurwitz, 
and Gurney (1946) for a sample designed to estimate sales of retail stores. 

If the estimate fp, is to be used, the same general argument applies. 


Example. The different methods of allocation can be compared from aes 
collected in a complete enumeration of 257 commercial peach orsi 
North Carolina in June 1946 (Finkner, 1950). The purpose was to determine t s 
most efficient sampling procedure for estimating commercial peach prosa 
in this area. Information was obtained on the number of peach trees and tl E 
estimated total peach production in each orchard. The high correlation betwee 


T $ : d 
these two variables suggested the use of a ratio estimate. One very large orchar 
was omitted. 


For this illustration, the area is divided 


geographically into three strata. The 
number of peach trees in an orchar 


d is denoted by z,;, and the estimated pro- 
duction in bushels of peaches by Yni- Only the first ratio estimate n, (based on 
a separate ratio in each stratum) will be considered, since the principle is the 
same for both types of stratified ratio estimate. 

Four methods of allocation are compared: (a) m, proportional to Np, (b) "n 


proportional to N,,,, (c) n, proportional to N, V X,, and (d) m, proportional 
to N,X, = X,. The sam 


ple size is 100. The data for these comparisons are 
summarized in Table 6.3. 
The upper part of the table shows the b; 


asic data. The method employed to 
Calculate the four variances was first to find the z, for each type of allocation. 
These values are shown in the columns hea 


ded (a) through (d) in the lower part 
of the 


table. Thus, with allocation (a), n, — nN;|N, so that in the first stratum 


n — 00047) — 
2 256: 
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TABLE 6.3 
DATA FROM THE NoRTH CAROLINA PEACH SURVEY 
Strata Sa Syza S Sa Sy X, Y, Ry San? 
1 5186 6462 8699 72.01 93.27 53.80 69.48 1.29133 658 
2367 3100 4614 48.65 67.93 31.07 43.64 1.40475 573 
3 4877 4817 7311 69.83 85.51 56.97 66.39 1.16547 2706 


o 


Pop. 3898 4434 6409 62.43 80.06 44.45 56.47 1.27053 1433 
Strata N, (9 MS. (0 VX NVA © Mh @ 

1 47 18 4384 22 733 3445 20 2929 22 

2 118 46 8016 40 5.57 657.3 39 3666 32 

3 91 36 7781 38 7.55 687.1 41 5184 46 
ER S Ver. oe ree 
Pop. 256 100 -20181 100 20.45 1688.9 100 11379 100 
When the n, have been obtained, the corresponding V( Pp.) is found by 


substituting in the formula 


NiNa — 
V(Prs) Sa = 2) 


h 


Sy? 


where 
Sap? = Sy? + Ry?Szn? — 2RrSyzn 


The quantities Sa}? are the same for all four allocations and are given on the 


extreme right of the top half of Table 6.3. s 
The variances and relative precisions are shown in Table 6.4. 
There is not much to choose among the different allocations, as would be 
expected, since the , do not differ greatly in the four methods. Method 4, in 
which allocation is proportional to.the total number of peach trees in the stratum, 


appears a trifle superior to the others. 


TABLE 6.4 
COMPARISON OF Four METHODS OF ALLOCATION 
Variance 
Method of pM 
Allocation: m, Strata y 
Proportional —<——---— | a Relative 
to 1 2 3 Total Precision 
1. N, 49,824 105,833 376,215 531,872 100 
DISA 35,144 131,847 343,446 510,437 104 
3. Na V X, 41,750 136,964 300312 479,026 111 
4. NX, 35,144 181,710 — 240,888 457,742 116 
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6.15 UNBIASED RATIO-TYPE ESTIMATES 


In recent years there has been considerable interest in developing oe 
mates of the ratio type that are unbiased or subject to a smaller bias tha j 
the ordinary ratio estimate. Such estimates might be useful in meh 
with many strata and small samples in each stratum if the separate ra 
estimate seems appropriate. x 

One estimate, de i Hartley and Ross (1954), can be derived by starting 
with the mean F of the ratios y;/x; and correcting it for bias. 


Now 


C Sige pap yt ~ (£31)x 
Ner e Nim —~ NaS 


= Y — XE(r) = X[R — E(r)] 
But in simple random sampling E(r) = E(r;). Hence 


m n : IN v 6.28) 
b —E—R-2-d.yrna-X (6. 
ias in 7 (r) XN 2r ) 


By theorem 2.3, an unbiased sample estimate of 


N 


Tèra- 8) 


1 i=1 


Xn — 2) =—"_@- 
ave MEE EST. 
n—1£&' n—1 j 
On substituting into (6.28), the estimate F, corrected for bias, becomes 


nes n(N —1) ,. 
FUE penu ss D: 
(n— NX 


The corresponding unbiased estimate of the population total f is 


— pg) (6.29) 


PX = FX 4 Dg — 72) (6.30) 
CT 


Example. Com 


ute t i ing simple 
random sample pute the estimated stratum total for the following 


of size 8 from a stratum with N = 16, X — 106. From Table 5 


PX = Q.389)(106) + S [11.000 — (2.389)(5.5)] 
= 253.2 — 36.7 = 216.5 
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TABLE 6.5 
COMPUTATION OF THE HARTLEY-Ross EsTIMATE 
Unit 
1 2 3 4 5 6 7 8 Mean 
Yi 8 15 5 7i 5 13 11 24 11.000 
i 8 6 1 4 3 5 4 13 5.500 


ři 1.000 2.500 5.000 1.750 1.667 2.600 2.750 1.846 2.389 
E L ae 


An exact formula for V(F’X) exists for any size of sample. If the fpc is 
negligible, the formula is 


2 - N* 
Vea) = V (sa + risa cams + 5 8d s) (60D 


ulation mean and variance of the r; and Sps is 
the population covariance of r; and z;. This formula was given by Good- 
man and Hartley (1958), who also give an unbiased sample estimate of 
V(F'X). For the formula for the true variance when the fpc is not negli- 
gible, see Robson (1957). 
General comparisons of the 
estimate f, can be made only in s 
mate formula for V( f) is valid. With m large, the second 
be omitted. The first term can be rewritten as 
2N voL YR 
Vx) = M Bo Fo nec c J] 
n ic N-1 


where r, and S,? are the pop 


precision of ;'X and the ordinary ratio 
amples large enough so that the approxi- 
term in (6.31) can 


For f, the corresponding expression is 
INAH Raj) 
dim n à N Sal 


Thus, as Goodman and Hartley point out, ;'X is more precise in large 
samples if the line Y + r,(z; — X) fits the values y; more closely than the 
line Ra; Although extensive comparisons have not been made, it seems 
likely that in most applications in which ratio estimates are appropriate 
V(5) will be smaller in large samples. Further comparisons of the esti- 
mates in small samples would be valuable, since these are the cases in which 
freedom from bias may be important. 

As an alternative approach, Lahiri (1951) showed that the ordinary ratio 
estimate is unbiased if the sample is drawn with probability proportional 
to Xz, These are two ways of drawing the sample. One, due to Lahiri, is 
to draw a sample without replacement in the usual way. If T is the sum 
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of the n largest values of x, in the population, draw a random number 
between 1 and T, say v. If Ex, > v for the sample, retain it. Otherwise 
replace it and start again, drawing a new random number for each sample 
that is tried until one is found that can be retained. Clearly, the prob- 
ability that a sample will be retained is proportional to 2x,. A second 
method, Midzuno (1951), is to draw the first member of the sample with 
probability proportional to x; and the remaining (n — 1) members with 
equal probability. The following proof shows that R is unbiased. . 

If Zz; is added over all simple random samples of size n, the total is 


[dr i) X, since every unit appears in (s i) samples. The prob- 


ability that a specified sample will be drawn is therefore 
fore 


Èz 
N-1 
(naa) 
For this method of sample selection, with R, = Zy,/Zzx;, 


E(R;) => (e zu) 

SrSN > x, 

where © denotes a sum over all simple random samples. Substituting 
SrS 

the value of P, 


6 — i) 
ER as | xw iu-1-. 
Srs (Gari) 2253, Sale 
n—1 n—1 
No exact expression for V(R,) has been found. It is easily shown that 
in large samples R, has the same approximate variance as R. 


Example. This illustrates an artificial population in which the Lahiri estimate 


performed well. The population contains three strata with Ny =4, m = 20 
each stratum. The population was deliberately constructed so that (a) Rn varies 
markedly from stratum to stratum, thus favoring a separate ratio estitnate, an 
(b) the ratio estimate within each Stratum is badly biased. Five methods © 
estimating the population total Y were compared. 


Simple expansion: 


= p 3 XN. 
e combined ratio estimate: etls) X 
Ll 
The Separate ratio estimate: G2) X, ^ 
Li 


The separate Hartley-Ross estimate: YFK X,) 
R 


The separate Lahiri estimate: 


2G 4) Xy, 
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There are 6? = 216 possible samples. Since estimates were made for every 
sample, biases and variances are exact. For the computations I am indebted to 
Joseph Sedransk. 


TABLE 6.6 
A SMALL ARTIFICIAL POPULATION 
Stratum 
I II III 
y x y x y x 
2 2 2 1 3 1 
3 4 5 4 7 3 
4 6 9 8 9 4 
11 20 24 23 25 12 
Totals 20 32 40 36 44 20 
R, 0.625 1.111 2.200 
20 ew we Sors SUSHI CUP Rc 
TABLE 6.7 
RESULTS FOR THE DIFFERENT ESTIMATES OF Y 
Method Variance (Bias)? MSE 
LIT CO TONER AE qe -dhRL Ed o 
Simple expansion 820.3 0.0 820.3 
| Combined ratio 262.8 6.5 269.3 
Separate ratio 35.9 24.1 60.0 
Separate Hartley-Ross 153.6 0.0 153.6 
19.6 0.0 19.6 


Separate Lahiri 


The results show several interesting features. For the combined ratio 
estimate, the contribution of the (bias)? to the mean square error is trivial, 
despite the extreme conditions. The separate ratio estimate is much more 
accurate than the combined estimate because of the wide variation in R,, 
but it is badly biased. The Hartley-Ross separate estimate is superior to 
the combined ratio estimate but inferior, as judged by the mean square 
error, to the separate ratio estimate. The separate Lahiri estimate was 
easily the best of the group. The Lahiri method suits this population 
because the fourth unit in each stratum has a high probability of being 
drawn and samples containing this unit give good estimates of R,. 

? No general conclusions can be drawn from this example. One practical 
limitation of the Lahiri method is that the sampler would probably not 
want to draw the sample with probability proportional to Ez; unless he 
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intended to use this type of ratio estimate for all the major items in the 
survey. . í 
anail (1956) has produced a method of adjustment, applicable 
to a broad class of estimates, which reduces the bias from order 1/n to 
order 1/n?. The utility of this method for ratio estimates was pointed out by 
Durbin (1959). The bias of estimates like Å, Pp may be expanded in a 
Taylor series of the form 

ER RB B s. (6.32) 

nn 


Let the sample be divided at random into g groups, each of size m, where 
n= gm. From (6.32) 


E(gR) = gR IUUD TUE (6.33) 
m gm 
Now let Å, be the ordinary ratio Dy/Zx, computed from the sample after 
omitting the jth group. Since Ê; is obtained from a simple random sample 
of size m(g — 1), we have 
b b 
ERIS ROE ce 
(9) (g — 1)m " (g — iym? 
Hence 
R by bs 
EK(g — 1)R,) = (g — DR + = + —— Gt 
m (g—1)m 
Subtraction from (6.33) gives, to order n-?, 
b b g 
Eick Rl RES eR 
rae glg — Dm n? (g— 9D 
Thus the bias is now of order I/n?, We can construct g estimates of this 


type, one for each group. Quenouille has shown that if their average 15 
taken, that is, 


Roze- g- hetet tR, 


* H " s 1 
its variance differs from that of R by terms of order 1/n?. Any increase 1? 
variance due to this adjustment for bias should therefore be negligible m 
moderately large samples. 
The simplest estimate of this type i i i — 2. The esti- 
pe is obtained by taking g — 2- 
mates R, and R, are those given by the two halves of the sample, and 


Ro =2R— hit f 
2 
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At the other extreme we can take g = n. The advantages of one choice of 
g over another have not been investigated. 

This method cannot be expected to help when small samples are taken 
within strata, as in the artificial example with n, = 2 on p. 179. In samples 
of moderate size from populations showing wide variation in x, it may be 
worth applying as a precaution. 


6.16 COMPARISON OF TWO RATIOS 


In analytical surveys it is frequently necessary to estimate the difference 


R — R’ between two ratios and to compute the standard error of R- R. 


The formuias given here are for the estimated variance of R — R’, since 
ired. The fpc terms are omitted for 


these are the ones most commonly requ 
reasons presented in section 2.12. 
Simple random sampling is assumed at first. Three case: 


tinguished. 


s can be dis- 


The Two Ratios Are Independent 


This occurs when the units are classified into two distinct classes and we 
wish to compare ratios estimated separately in the two classes. For in- 
stance, in a study of household expenditures a simple random sample of 
households might be subdivided into owned and rented houses in order to 
compare the proportions of income spent on upkeep of the house in the 
two classes. If the estimated ratios are denoted by R = g/t, R' ge, 
then 

WR — R’) = o(R) + w(R) 


The Two Ratios Have the Same Denominator 

When the unit is a cluster of families, we might wish to compare the 
proportion of adult males who use electric shavers with the proportion who 
use razors, In any unit, y = number of adult males using electric shavers, 
y’ = number of adult males using razors, and x — total number of adult 
males. 


R-R-i-b 


T 


If d, = y, — y;', the estimated variance of R — R’ may be computed as 


A — R)=— 1 2.14 — R Ra 


n(n — Dez i 
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The Two Ratios Have Different Denominators but May Be Correlated 


An example is the comparison of the proportion of men who Emors 
with the proportion of women who smoke, in a survey in which the unit 1s 
a cluster of houses. Mathematically, this is the most general case. 


v(R — R’) = wf) + uR’) — 2 cov (RR?) 
The only unfamiliar term is cov (RR’). Writing, in the usual way, 
Rane Re ies Re 
X 
we have 


1 to! 
cov (RR’) = —— cov (y, — Rzj(y/ — R'x/) 
(RR’) Y (y, 
A sample estimate may be computed as follows: 


n D 
cov (RR^) = LÁ Dd (yin! — Ry/z, — Rye! + RR) 
n(n — 1)zz’ 

Example. The 1954 field trial of the Salk polio vaccine was conduct ee 
among children in the first three grades in all schools in a number of counts 
The counties were not randomly selected, since those with a history of Lipa 
polio attacks were favored, but for this illustration it will be assumed that they 
are a random sample from some population. : iwere 

Children whose parents did not give permission to participate in the trial Tu 
called the “not inoculated" group and, of course, received no shots. Half o d 
children who received permission were given three shots of an inert liquid ^io 
were called the “placebo” group. From the data in Table 6.8, compare : 
frequencies R, R’ of paralytic polio in the “not inoculated" and "placebo .Br our 
To reduce the amount of data, the comparison is restricted to 34 counties, eac 
having more than 4000 children in the two groups combined. t 

In these data any variation in the polio attack rate from county to county 
would produce a positive correlation between f and R’. 

The following quantities are derived from the totals. 


88 167.4 
Placebo: CL n i = —— = 
acebo R 1574.7 0925687, z 34 749255 
A 99 284.6 
Not lated: R = —— = ep eh 
lot inoculated: R 284.6 0.347857, 7 = 3d 8.3706 
For v(R), o(R’) and cov (RR), all uncorrected sums of squares and products 
among the four variates are required. 


1 
8) = pe (LY -2RYus + YS 


1 
B4)B3)4.9735)2 (064) — (1.05137)(822.2) + (0.27635)(1661.92)] 
= 0.00584 
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TABLE 6.8 
NUMBER OF CHILDREN (2, z^) AND OF PARALYTIC. CASES (y, y") PER COUNTY 
z* x yt y x x y y 
4.1 2.4 0 0 13.8 25.6 3 3 
3.5 8.0 1 6 10.5 8.1 2 0 
4.1 6.1 7 2 21.6 25.9 10 7 
2.6 4.6 2 1 3.5 6.7 2 2 
24 1.5 2 1 6.8 7.3 3 8 
2.2 1.9 0 0 213 3.7 0 1 
1.1 4.0 1 1 2.6 2.9 2 0 
1.6 4.0 1 2 6.0 11.1 3 1 
5.7 7.8 1 4 11.0 14.8 i 11 
33 11.0 3 7 19.4 42.5 11 14 
1.0 3.8 0 1 6.8 13.7 6 2 
2.0 5.2 1 0 1. 4.0 3 1 
83 190 4 4 5.4 9.3 11 6 
1.0 3.7 1 5 1.7 2.6 0 2 
1.1 4.2 0 1 2.1 2.3 0 0 
2.3 6.8 1 2 1.5 2.6 0 0 
1.9 3.5 0 2 3.0 4.0 0 2 

Totals 167.4 284.6 88 99 


* x, x = numbers of “placebo” and “not inoculated” children (in 1000's) 
ty, y’ = numbers of paralytic polio cases in the placebo and not inoculated 
groups 


Similarly, we find v(R’) = 0.00240. 
cov (RR) = ——L (xw — RY yz — R Dyr + RR D ze’) 


n(n — 1)tx’ 
— (0.52569)(844.6) — (0.34786)(1397.4) 
i De i + (0.52569)(0.34786)(2690.8) 


COMETE (34)83)(4.9235) (8.3706) [e 
— 0.00127 


Hence 
fale ee x1 0er nua 


s.e.(R — R^) = V0.00584 + 0.00240 — 0.00254 = 0.0754 


Sine Å — R = 0.1778, the difference approaches significance at the 5% level 
(the distribution of Ê — R’ may be somewhat skew for this size of sample). 
A possible explanation is that the not-inoculated children may have had more 


natural protection against polio than the placebo children. 

The same problem may arise in stratified samples in which the domains 
of study cut across strata. If RS R,' appear to vary from stratum to stratum, 
the comparison will probably be based on an examination of the values of 
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Figure 12.1 plots the values of the ratio c,,/c,, (on a log scale) against p. 
Curve I is the relationship when double and single sampling are equally 
precise; curve IT holds when V,,, = 0.8V(y), that is, when doubie sampling 
gives a 25% increase in precision; and curve III refers to a 50% increase 
in precision. For example, when p — 0.8, double sampling equals single 
sampling in precision if c,/c,, is 4, gives a 25% increase in precision if 
CalCy: is about 74, and a 50% increase if c,[c,. is about 13. 

For practical use, the curves overestimate the gains to be achieved from 
double sampling, because the best values of n and n' must either be esti- 
mated from previous data or be guessed. Some allowance for errors in 
these estimations should be made before decidingto adopt double sampling. 

For any p, there is an upper limit to the gain in precision from double 
sampling. This occurs when information on £' is obtained free (c, = 0). 
The upper limit to the relative precision is 1/(1 — p?). 


Ratio of cost per unit in second sample to cost per unit in first sample 


04 05 0. 


e 
N 


08 
P = Correlation between yi and x; 


Fig. 12.1 Relation between c,/c,, and p for three fixed values of the relative precision 
of double and single sampling. 


Curve I: double and single sampling equally precise. 
un B double sampling gives 25 per cent increase in precision. 
urve III: double sampling gives 50 per cent increase in precision. 
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12.7 ESTIMATED VARIANCE IN DOUBLE SAMPLING 
FOR REGRESSION 


If terms in 1/7 are negligible, V(7,) is given by (12.24): 


^ S21 — 2 25.2 
Vg.) = EP) 4 85e 


n 


With a linear regression model, the quantity 


2 
Sy.z 


=H [e-me | 


is an unbiased estimate of S,°(1 — p°), where the subscript « has now 
been dropped. Since 


To Le 
A n—1 


is an unbiased estimate of S,?, it follows that 
ge pos She 


is an unbiased estimate of p*S;?. 
Thus a sample estimate of V(y;,) is 


2 rene 
vg) = 2 =e = (12.29) 


If the second sample is very small and terms in 1/n are not negligible, 
a suggested estimate of variance is 


M 1.. 9-2) 5° — Sie 
v.s) = s 3r 5 (x, — >] A n 


This is a hybrid of the conditional variance and the average variance. 


12.8 RATIO ESTIMATES 


If the first sample is used to obtain Z' for a ratio estimate of Y, the 


estimate is 
x (12.30) 


SI ser 


Jn = 
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R, — R,' in individual strata. By finding the standard poo ame 

it is possible to determine whether these differences vary $ 

to stratum and, if not, to compute an efficient over-all difference. Tonay 
If the R,, R,’ exhibit no real variation from stratum to put RE 

be sufficient to compare the combined estimates R, and R,’. As be 


R, — R^) = R) + (R7) — 2 cov (RR) 
where, putting dj; = (y,, — J) — Ry, — Ta), 


cuv (CEST, 


A N, 
WR) = e E Sl Ed 


A N 2 7 
cov (R.R,’) = — l 5 DD dndn 
EQ. a nn, — 1) 

A more thorough discussion of the comparison of ratios, including short- 


cut computing formulas when the sample permits them, has been given by 
Kish and Hess (1959). 


6.17 MULTIVARIATE RATIO ESTIMATES 

Olkin (1958) has extended the r: 
auxiliary z-variables (21, 25, - 
total, the proposed estimate, s 


atio estimate to the situation in which p 
**a,) are available. For the population 
ay Yrs for multivariate ratio, is 


PR Wie XH 2 xpos 4! 
1 2 


z D 


Wp =X 
T, 

= Ws + Wan, + Wf. 

where the W, are weights to be determined to maximize the precision of 

ord subject to ZW, — 1, This type of estimate appears appropriate whe T. 

the regression of y on z, xe,- - z, is linear and passes through the origin. 

The population totals X, must be known, 

The method is descri 


Sane bed for two “-variates, since this should be the most 
frequent application, We have 


Pun Y= Wf, — yy Wf, — y) 
Hence 
V( Fira) FURAN Pr) + 2W,W, cov (043 12) + Wy Tg) 
US 2W, WoVig + WV, 
where V,, = VY, 


)> etc. The values of W,, W, which minimize the 
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variance, subject to W, + W, = 1, are found to be 


m= Poc Pa W, = Vu — Viz 
Va + Von — 2Vi2 Viz + Vas — 2V 


and the minimum variance is 
VaVe — Vas? 
Va + Vas — 2Va 
With p variates, it is necessary to compute the inverse V? of the matrix 


Vis Then the optimum W; = X,[Z, where X; is the sum of the elements 
in the ith column of V! and Z is the sum of all the p? elements of V“. The 


minimum variance is 1/X. 
In practice, the weights are determined from estimated variances and 


covariances v;. From (6.7) in section 6.3, 


Vinin(Y are) = 


EET 
E apr (cy, + €&u — 264) 


2 
uro d CREE Ra) 
n 


Where c,, = s,?|7?, etc. The covariance can be expressed as 


Vg = CHIP Gy + Ce — Ga — C) 


A convenient method of computation is first to obtain the matrix 
Cy Syl Cy2 
C= |n Cu ĉn 
Cye C12 C22 
If v = nv,,[(1 — f) 2, the matrix v, is easily obtained by taking diag- 
onal contrasts in C, that is, 
Dy = Cy € — € en 
Dig = Cyy te Cia Cyl Cy2 ete. 
The factor (1 — f) £2/n is not needed when computing the w;, but it must be 


Inserted when computing the minimum variance. Thus 


1 — f)¥? (oi vss — vig”) 
"Tos = a-p "nm r 
Prin CT arn) n (ta + ves! — 203) 


186 4 SAMPLING TECHNIQUES 


ion i is esti ill prob- 

In view of the amount of computation involved, this Cini wien A 

ably be restricted to smaller surveys of specialized ud The $i 
capable of giving a marked increase in precision over Yp, or Yp, 


EXERCISES 


i f 
6.1 A pilot survey of 21 households gave the following data for numbers o 
members (2), children (y,), cars (yj), and TV sets (ys). 


E 
E 
= 
E 
a 
iy 
= 
Ex 


tà C) Ov d» 4A IO tA 


——-onwvw-o 
BANWAREAD 


Assuming that the total population X 
ratio esti 


A S: rS 
mates be used instead of simple expansions for estimating total numbe' 
of children, cars, and TV sets? 


t 
is known, would you recommend tha 


62 Ina field of barley the grain, y;, 


and the grain plus straw, «;, were weighed 
for each of a large number of samplin 


g units located at random over the hield: 
The total produce (grain plus straw) of the whole field was also weighed. is 
following data were obtained: Cyy = 1.13, cy, = 0.78, czs = 1.11. Compu! 
the gain in precision obtained by 


63 For the data in Table 6.1, Pr = 28,367 and cg; = BID CENSUM EST 
0.0146541, c7; = 0.0156830, Compute the 95% quadratic confidence limits fo 
Y and compare them with the limits found by the normal approximation. 

6.4 The values of Y, æ in a population with N — 6 are as follows: 


HONO 6774 16 T8 G3 


ea 1 2 2 3 3 3 
Check that the Tegression of y on æ isa 
computing f for all 15 sim; 


Straight line through the origin. BY 
ple random s 
random Samples with n = 


amples with n = 2 and all 20 simple 

3, verify theorem 6.2 that R is unbiased in both cases- 

es of y and z are measured for each unit in a simple random 
Opulation. If X, 


ri s 
: Tecommend for estimating Y/X? (a) Alway: 
Sometimes use j, : ale yz. Give 
Teasons for your answer. IIX and Sometimes 9/7. (e) Always use 9/4. 
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6.6 The following data are for a small artificial population with N = 8 and 
two strata of equal size: 


Stratum 1 Stratum 2 
vy Yii Toi Yzi 
2 0 10 7 
5 3 18 15 
9 7 21 10 
15 10 25 16 


For a stratified random sample in which n, = ng = 2, compare the MSE's of 
Prs and Yn. by working out the results for all possible samples. To what 
extent is the difference in MSE's due to biases in the estimates? 

6.7 In exercise 6.6 compute the variance given by using Lahiri's method of 
sample selection within each stratum and a separate ratio estimate. 

6.8 Forty-five states of the United States (excluding the 5 largest) were 
arranged in nine strata with five states each, states in the same stratum baving 
roughly the same ratio of 1950 to 1940 population. A stratified random sample 
with 7, = 2 gave the following results for 1960 population (y) and 1950 popula- 
tion (x), in millions. 

Stratum 


1 D 3 4 5 6 7 8 


Vr 0.23 0.63 0.7 2.54 4.67 432 456 1.79 218 
Try 0.13 0.50 0.91 2.01 3.93 3.96 4.06 1.91 1.90 
Yna 4.95 2.85 0.61 6.07 3.96 1.41 3.57 1.86 1.75 
Tro 2.78 238 0.53 4.84 3.44 1.33 329 201 132 


9 


Given that the 1950 population total X is 97.94, estimate the 1960 population 
by the combined ratio estimate. Find the standard error of your estimate by 
Keyfitz’ short-cut method (section 6.13). The correct 1960 total was 114.99, 
Does your estimate agree with this figure within sampling errors? 

6.9 In the example of a bivariate ratio estimate given by Olkin, a sample of 


50 cities was drawn from a population of 200 large cities. The variates ACTUS 
are the numbers of inhabitants per city in 1950, 1940, and 1930, respectively. For 
the population, Y = 1699, X, = 1482, X; = 1420 (in 100's) and, for the sample, 
9 = 1896, z, = 1693, % = 1643. The C matrix as defined in section 6.17 is 


Ti c 


y 
y 1.213 1.241 1.256 
zc) 1.241 1.302 1.335 
T3 1.256 1.335 1.381 


, Estimate Y by (a) the sample mean, (b) the ratio of 1950 to 1940 numbers of 
inhabitants, and (c) the bivariate ratio estimate. Compute the estimated standard 


error of each estimate. 
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6.10 Prove that with Midzuno’s method of sample selection (section 6.15) 
the probability that any specific sample will be drawn is 


(= DXN —! S) 
(N —1)! X 
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CHAPTER 7 


Regression Estimates à 


71 THE LINEAR REGRESSION ESTIMATE 


dedi the ratio estimate, the linear regression estimate is designed to 
Sd ase precision by the use of an auxiliary variate t; which is correlated 
ith y,. When the relation between y; and z; is examined, it may be found 
at although the relation is approximately linear the line does not go 

rough the origin. This suggests an estimate based on the linear regression 
of y, on a, rather than on the ratio of the two variables. 

We suppose that y; and x; are each obtained for every unit in the sample 
and that the population mean X of the z; is known. The linear regression 
estimate of Y, the population mean of the y; is 

Tr = 7 + bY -9 an 
ression and b is an estimate of the 
The rationale of this estimate is 


below average by an 
For an estimate of 


where the subscript /r denotes linear reg 
change in y when z is increased by unity. 
that if z is below average we should expect y also to be 
amount b(X — z) because of the regression of y; on 2}. 
the population total Y, we take f, = Niir- 

Watson (1937) used a regression of leaf area on leaf weight to estimate 
the average area of the leaves on a plant. The procedure was to weigh all the 
leaves on the plant. For a small sample of leaves, the area and the weight 
of each leaf were determined. The sample mean leaf area was then adjus- 
ted by means of the regression on leaf weight. The point of the application, 
Is, of course, that the weight of a leaf can be found quickly but determi- 
nation of its area is more time consuming. 


This example illustrates a general situation in which regression estimates 


are helpful. Suppose that we can make a rapid estimate z; of some charac- 

teristic for every unit and can also, by some more costly method, deter- 

mine the correct value y; of the characteristic for a simple random sample 

Of the units. A rat expert might make a quick eye estimate of the number 
189 
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To find the approximate variance, write 


c vy v 
—y=47_Y 
YR 2 l 
-(£r-r)+2@-% 
L3 x 
SŽ g-ra y) 
T x 


The first component is the error of the ordinary ratio estimate (section 2.9). 
In obtaining the approximate error variance in section 2.9, we replaced 
the factor X/ by unity in this term. To the same order of approximation, 
we replace the factor 9/z in the second component by the population ratio 
R= Y[X. Thus 


r — Y= (7 — RZ) + R(z' — X) (12.31) 
If the first and second samples are drawn independently, we obtain 
gt 20 2 26 2 
V(Gjp) = $$ — 2RS,, + R'S.? + RSs (12.32) 
n n 


where the fpc terms are assumed negligible. 
If the second sample is a random subsample of the first, rearrange 
(12.31) in the form 
jp — Y= (7 — RX) + RŒ — z) = (J — Y) + RŒ — 3) 
It may be verified that, with the fpc ignored, 


2 
vg — Y) = Sv 
n 


cov ((y — Y)R(z' — z)] = -RS (+ ~ i) 


n! 


V[R(z — z)) = rsa (2 — +) 
non 
Hence V(75) takes the form 


2RS4— RES” — (1233) 


z Sj? — 2RS 25 2 
V(gg) = © —2RS, + R'SP^ | z 


n n 
Note that formulas (12.32) and (12.33) are both of the form 


A Y, , 
Vg) ^ 4 Vw 
n n 
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Hence the optimum choices of n and n’, and the minimum variance for 
comparison with single sampling, are found by the same procedure as 
for stratification and regression estimates. 

For sample estimates of variance, the quantities s,7, Syz Sz» and R may 
be substituted in (12.32) and (12.33). The resulting estimates v(ğ x) are not 
unbiased but appear to be adequate to the order of approximation 


presented in the analysis. 


129 REPEATED SAMPLING OF THE SAME 
POPULATION 


As confidence in sampling has increased, the practice of relying on 
samples for the collection of important series of data that are published 
at regular intervals has become common. In part, this is due to a realiza- 
tion that with a dynamic population a census at infrequent intervals is of 
limited use. Highly precise information about the characteristics of a 
population in July 1950 and July 1960 may not help much in planning 
that demands a knowledge of the population in 1964. A series of small 
samples at annual or even shorter intervals may be more serviceable. 

When the same population (apart from the changes that the passage 
of time introduces) is sampled repeatedly, the sampler is in an ideal 
position to make realistic estimates both of costs and of variances and to 
apply the techniques that lead to optimum efficiency of sampling. One 
important question is how frequently and in what manner the sample 
should be changed as time progresses. Many considerations affect the 
decision. People may be unwilling to give the same type of information 
time after time. The respondents may be influenced by information 
which they receive at the interviews, and this may make them progressively 
less representative as time proceeds. Sometimes, however, cooperation 
is better in a second interview than in the first, and when the information 
is technical or confidential the second visit may produce more accurate 
data than the first. 

The remainder of this chapter considers the question of replacement of 
the sample and the related question of making estimates from the series 
of repeated samples. The topic is appropriate to the present chapter 
because double sampling techniques can be utilized. 

Given the data from a series of samples, there are three kinds of quantity 


for which we may wish estimates: 


1. The change in Y from one occasion to the next 


2. The average value of Y over all occasions 


3. The average value of Y for the most recent occasion 
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of rats in each block in a city area and then determine, by trapping, the 
actual number of rats in each of a simple random sample of the blocks. In 
another application described by Yates (1960), an eye estimate of the 
volume of timber was made on each of a population of ,4,-acre plots, and 
the actual timber volume was measured for a sample of the plots. The 
regression estimate 

y + W(X — z) 


adjusts the sample mean of the actual measurements by the regression of 
the actual measurements on the rapid estimates. The rapid estimates need 


not be free from bias. If 7; — y; = D, so that the rapid estimate is perfect 
except for a constant bias D, then with b = 1 the regression estimate 


becomes 
VE A=- 
= (pop. mean of rapid estimate) + (adjustment for bias) 

Our knowledge of the 


i taken 
as Zero, J, reduces to y. If b. — y[z, Qoyionsy IAS tA 
hc: SR 
V —yL-(YX—-zlYvz 4 
ir = 7) > X-Y, (7.2) 


JP REGRESSION ESTIMATES WITH PREASSIGNED b 


Although, in most applicati i a 
RE PPlications, b is est e 
sample, it is somet 1mated from the results of th 


p ———— eÓ€ 
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Theorem 7.1. In simple random sampling, in which 5, is a preassigned 
constant, the linear regression estimate 
Tir = gb — 2) 
is unbiased, with variance 
N 
EDI — Y)- bx, — X) 


Vr) = 1 : SI (7.3) 


= 1=f (5,2 = 28s + aS) (7.4) 
n 


Note that no assumption is required about the relation between y and x 
in the finite population. 
Proof. Since by is constant in repeated sampling, 
E(g,) = E) + Es — X) = Y 


by theorem 2.1. Further, j, is the sample mean of the quantities 
Yi — bolz; — X), whose population mean is Y. Hence by theorem 2.2 


N = =; 
Siu — Y) — bole: — Y) 
zo 


Vg) ==: = 


icf ga SS) 
n 


Corollary. An unbiased sample estimate of V(7;; is 


1p BG 0 - bles 9Y 
n 


v(y,) = Fan 


= L hs. boss) 
n 


This follows at once by applying theorem 2.4 to the variate y; — bolt; — X). 


A natural question at this point is: what is the best value of bọ? The 


answer is given in theorem 7.2. 
Theorem 7.2. The value of 5, which minimizes V(V,,) is 


Žu- De - D 
"INTRA IS Up 


i=1 


(7.5) 
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Finally, combining the bias in (7.23)' and the variance in (7.21), the mean 
square error of J, to terms of order 1/n2, is 


SS p tae 
n n 

This result suggests that if the kurtosis in the distribution of x is Sei 
erate, the large-sample formula for the variance of the regression estima 
should be adequate in samples of size 50 or more. ; ore 

A further consequence is that if n is large, so that terms in 1/7? a : 
negligible, an inefficient estimate b’ of B can be used instead of me ani 
Squares estimate, since errors in b contribute only to terms in In . FO 
instance, b might be computed from a subsample of the original gate 
Alternatively, if the units can be sorted conveniently into three equal-size 
groups—low, middle, and high—according to the values of z, the estimate 


(1.24) 


b’ = Thien — gio 
Thigh — iow 
has an efficiency of about 8/9 (Bartlett, 1949). 


7.5 FURTHER NOTES ON THE BIAS 

In a finite population the leadin 
in section 7.4, except that in a 
The principal change is to bri 


g term in the bias is obtained from (7.23) 
Pplying theorem 2.3 we do not let N — co. 
ng in the fpc, giving, 


N 
vy 
bias in g, = — ay x e(z; — x| 
(n—10SjL N=] 
The preceding analysis showe 
infinite popul Ji 


g term. A corresponding 
n of y on z is linear in the 


- By a linear regression in a finite population, 
we mean that if 
V Y Be, — ¥) 4 e, (7.25) 
then Ee; | 2) = 0 for every fixed T. This implies that if a particular value 
ofz appears on only one unit in the Population, the value of e for that unit 
must be zero. 
Theorem 7.4, 


If the regression of y, on z, is linear, then 
Jw y WX — z) 
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where b =E (y — A(z, — B/E (x; — 7} is the least squares regression of y 
onz,isan unbiased estimate of Y. The result requires that X (v; — z)? > 0 
in every sample. (If »,,,, is the largest number of units in the population 
that all have the same value of x, the condition È (x; — 2)* > 0 is satisfied 


for any n > vmar) 
Proof. If (7.25) holds with E(e; | x;) = 0, it is easy to show that B = B, 
the population regression of y on x as defined in (7.5). From (7.25), 


N 


a N = N 
>; z(y; — Y)— B X xu; — X)4 Xem 
k N 
ae E(e; | x;) = 0, X ex; = 0, so that f = B. Hence, from (7.15) and 
(X =2) x ACA =a z) té (7.26) 
» (x; — zy 


Now consider samples with the same set of values of z;. 


Vir — Y= 

Over such samples 
= n 

(X — 2) and Y (x; — 7} remain constant. Further, if any value of z occurs 


times in the population, each of the» units will 


occur equally often in samples of this type. It follows that the average 
values of X e,(a, — z) and é over this set of samples are both zero. Thus 
E(j, — Y) = 0 over this set of samples. Hence ¥;, is unbiased over all 


simple random samples of size 7. 


m times in the sample and » 


7.6 COMPARISON WITH THE RATIO ESTIMATE AND 


THE MEAN PER UNIT 


e sample size n must be large enough so that the 
nces of the ratio and regression esti- 
ble variances for the estimated popu- 


For these comparisons th 
approximate formulas for the varia 
mates are valid. The three compara 
lation mean F are as follows: 


V(y,) = LL S20 — p (regression) 
n 
V@p) = M (S2 + RS? — 2RpS,S) (ratio) 
n 
= inf 08 ^ : 
Vy) = Sy (mean per unit) 


Nn 


It is apparent that the vari 
than that of the mean per unit unless 
ances are equal. 


ance of the regression estimate is smaller 
p = 0, in which case the two vari- 
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The variance of the regression estimate is less than that of the ratio 
timate if 
p = PS} < RS? — 2RpS,S, 
This is equivalent to the inequalities 
(PS, — R$ >0 or (B—R?»0 l 
Thus the regression estimate is more precise than the ratio estimate 


unless B = R. This occurs when the relation between y; and z; is a straight 
line through the origin. 


Example. The precision of the regression, ratio, and mean pèr unit po 
from a simple random sample can be compared by using data collected in t 
complete enumeration of peach orchards described on p. 174. In this examp. “3 
y, is the estimated peach production in an orchard and x, the number of peac E 
trees in the orchard. We will compare the estimates of the total production o 


the 256 orchards, made from a sample of 100 orchards. It is doubtfi ul whether the 
sample is large enough to make the variance formulas fully valid, since the cv's 
of 7 and 7 are both somewhat higher th: 


an 1075, but the example will serve to 
illustrate the computations. The basic data are as follows: 


S? = 6409 Syz = 4434 S, = 3898 
R = 1.270 p = 0.887 n = 100 N = 256 
N(N — 
Wf) = TID SK — p) 
2 
= POLS 6409) — 0.787) = 545,000 
N(N — 
V) = N=) (5.2 E RESA jns) 
256)(156 
= Buse [6409 + (1.613)(3898) — 2(1.270)(4434)] 


= 573,000 


NN — 
VP) = we”) Sj = 2,559,000 


There is little to choose between the regression and the ratio estimates, aS 
might be expected from the nature of the variables. Both techniques are greatly 
Superior to the mean per unit. 


7.7 REGRESSION ESTIMATES IN STRATIFIED 
SAMPLING 

__As with the ratio estimate, two types of regression estimate can be made 

in stratified random sampling. In the first estimate Fy, (s for separate), 

a separate regression estimate is computed for each stratum mean, that is, 


Vin = Ty, + b(X, — %) (7.27) 
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Then 
Firs = D Wan (1.28) 
h 


This estimate is appropriate when it is thought that the true regression 
coefficients B, vary from stratum to stratum. 

The second regression estimate, Jire (€ for combined), is appropriate 
when the B, are presumed to be the same in all strata. To compute Jic 
we first find 

ya = X Ws Ta = > Wan 
Then 5 i 

Dire = Fa + b(X — Ext) (7.29) 
dered first in the case in which the b» 
their properties are unusually simple in 
an unbiased estimate of Y, so that 
pling is independent 


The two estimates will be consi 
and b are chosen in advance, since 
this situation. From section 7.2, Zia is 


Vrs is an unbiased estimate of Y. Further, since sam 
in different strata, it follows from theorem 7.1 that 


Vr) = X 


h 


Theorem 7.2 shows that V(firs) iS minimized when br = Bry 
regression coefficient in stratum h. The minimum value of the variance 


may be written 


?*(1 — 
mi — iy Gi Sa + PASO. | C20) 
h 


the true 


wt = Si 
Vani mAs z Sa) (7.31) 
A na ee 


Turning to the combined estimate with preassigned b, (7.29) shows that 
Tre is also an unbiased estimate of Y. Since Fire is the usual estimate from 
a stratified sample for the variate Yn: t b(X — o, We may apply theorem 
5.3 to this variate, giving the result 


24 — P 
vg.) = ZEP (Sp — 268a + US) (7.32) 
h LU ^ 
The value of b which minimizes this variance is 
2 
pata — fi Suen / yee fr)Son (155) 
h Ny h n, 


The quantity B, is à weighted mean of the stratum regression coefficients 
By, = Syan] Sar If we write 
w= 
EE c f) s 2 
h 


then B, = Za,B;[2: an d 
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From (7.31) and (7.32), with B, in place of b, we find 
Visi) — Vinin(Yirs) = Z aB — (Yay) BP 
= Xa(B,— B) (7.34) 
This result shows that with the optimum choices the separate estimate has 


a smaller variance than the combined estimate unless B, is the same in all 
strata. 


7.8 REGRESSION COEFFICIENTS ESTIMATED FROM 
THE SAMPLE 


The preceding analysis is helpful in indicating the type of sample esti- 
mates b, and b that may be efficient when used in regression estimates. 
With the separate estimate, the analysis suggests that we take 


» (yn; — Va); — Fp) 
Dr 5 (Eni — Ta)? 


the within-stratum least squares estimate of B,. 
Applying theorem 7.3 to each stratum, we have 


= wè — 
VG) = X 0 — 5 s a pd (135) 
D n, 
provided that the sample size n, is large in all strata. To obtain a sample 
estimate of variance, substitute 


LSE on W = bP S Cn — za] 


ny, — 2LF 
in place of S,,*(1 — p?) in (7.35). 

The estimate 7,,, suffers from the same difficulty as the corresponding 
ratio estimate, in that the ratio of the bias to the standard error may become 
appreciable. If follows from section 7.5 that the regression estimates Jira 
in the individual strata may have biases of order 1/n, and the biases may be 
of the same sign in all strata, so that the over-all bias in Tir, may also be of 
order 1/n,. Since the leading term in the bias comes from the quadratic 
regression of Y}; on £p; as shown in section 7.5, this danger is most acute 
when the relation between the variates approximate the quadratic rather 
than the linear type. 

With the combined estimate, we saw that the variance is minimized when 
b = B, as defined in (7.33). This suggests that we take 


=y 7U -A ; sm 5) M 
b= y —AM JM A Y A 5 yi 
T Mal, — 1) 2 Mec on. s i n(n — 1) n Ug) 


2 = 
Sy. zn = 
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as a sample estimate of B,. If the stratification is proportional and if we 
may replace the (7, — 1) in b, by 7», b, reduces to the familiar pooled least 
Squares estimate 


b' = x2 (Yai — PTni — %)/2 2 (m; — £j) 


In certain circumstances other estimates may be preferable to b, or b’. 
For instance, if the true regression coefficients B, are the same in all strata 
but the residual variances about the regression line differ substantially from 
one stratum to another, a different weighted mean of the bp, weighting 
inversely as the estimated variance, may be more precise. However, the 
gain in precision as it affects 7,,, is likely to be small. 

Since 


Vire Y- UE uod Y+ b(X v Zy) 
= [Ju — Y + B(X — 5] + (6. — BMX — Zu) 
it follows that if sampling errors of b, are negligible 


2 — 
V (fir) = D Wi — fo (Sj — 2B,Syen + B San) 
h n, 
Further examination of the structure of b, and b,’ shows that they are in 
general biased. If stratification is proportional and the residual variance 
about the regression line is approximately the same in all strata, the result- 


ing bias in J, is of order 1/n, as is the contribution of V(b;) to the variance 


of Jre: If, however, the contribution of one stratum to the variance is 
n of V(b.) to VGire) 


predominating, say the Ath stratum, the contributio 
may be as large as 1/n;. 
As an estimate of V(y,,.), we may take 
Wn) = X PEO y (uy — 8) — bn. — BP 
A n(n — 1) = 
The sum over i may, of course, be computed as 


x Gn — Gr)” — 2b. » (Yni — yay — z) + bj » (tri — zy 


79 COMPARISON OF THE TWO TYPES OF 
REGRESSION ESTIMATE 


Hard and fast rules cannot be given to decide whether the separate or 
the combined estimate is better in any specific situation: some exercise of 
judgment is required in making a choice. The defects of the separate esti- 
mate are that it is more liable to bias when samples are small within the 
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é 3 T T" 
individual strata and that its variance' has a HL FUR Nor HERR 
i i i fficients. The defect of the co’ 

sampling errors in the Tegression coeffic 1 ( a 
PaL that its variance is inflated if the population regression coe 
cients differ from stratum to stratum. d 

If we are confident that the regressions are linear and if B, appears to A 
the same in all strata, so far as can be judged, the combined PR T 
be preferred. If the regressions appear linear (so that the danger o we 
seems small) but B, seems to vary from stratum to stratum, the Ee 
estimate is advisable. If there is some curvilinearity in the Tegressions w Ea 
a linear regression estimate is used, the combined estimate is probably sa 
unless the samples are large in all strata. ; 

The development of estimates of the regression type that are unbiased 


has been discussed by Mickey (1959), but these estimates have not yet been 
extensively tried. 


EXERCISES 
7.1 An experienced farmer makes an eye estimate of the weight of peaches 


z; On each tree in an orchard of N — 200 trees. He finds a total weight of 


X — 11,6001b. The peaches are picked and weighed on a simple random sample 
of 10 trees, with the following results: 


Tree Number 


Mens 38 T4: Sí Gt TO gy 9. a9. ^ "perl 


Actual wt. y; 61 42 50 58 67 45 39 57 7| 53 543 
Est. wt, — a, 39 47 52 60 67 48 44 58 76 sg 569 
ain PM D 


As an estimate of the total actual weight Y, we take 


P=NY+QG-—a] 
Compute the estimate and 


7.2 - Does it appear that the linear re, 
Squares b, would give a m 


74 In exercise 7.3 find the estimated total Number of inhabitants and its 
Standard error if b js taken as 1, 


7.5 In the following Population with y 
of y on z is linear and (5) that th 
random samples with n3, 
(12, 3). 


1 = 5, verify (a) that the regression 
e linear regression estimate is unbiased in simple 


€ (y, 2) pairs are (3, 0), (5, 0), (8, 2), (8, 3), 
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7.6 A rough measurement z, made on each unit, is related to the true 
measurement y on the unit by the equation 


z=ytet+d 


where d is a constant bias and e is an error of measurement, uncorrelated with 
y, which has mean zero and variance S, in the population, assumed infinite. 
In siraple random samples of size n compare the variances of (a) the “difference” 
estimate [y + (X — #)] of the mean Y and (b) the linear regression estimate, 
hr y value of b which gives minimum variance. (The variances may depend 
on S,?). 

7.7 By working out all possible cases, compare the precisions of the separate 
and combined regression estimates of the total Y of the following population, 
when simple random samples of size 2 are drawn from each stratum: 


Stratum 1 Stratum 2 
Tii Yue Xo; Voi 
ROM o ee ut AS. 
4 0 5 7 
6 3 6 12 
7 5 8 13 


Use the ordinary least squares estimates of the B's, b, and b, on p. 202. 
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CHAPTER 8 j 


Systematic Sampling 


8.1 DESCRIPTION 


This method of sampling is at first sight quite different from simple 
random sampling. Suppose that the N units in the population are Ug 
bered 1 to Nin some order. To select a sample of n units, we takea unit a 
random from the first k units and every kth unit thereafter. For instance, 
if k is 15 and if the first unit drawn is number 13, the subsequent units are 
numbers 28, 43, 58, and so on. The selection of the first unit determines the 


whole sample. This type is called an every kth systematic sample. ; 
The apparent advantages of this method over simple random sampling 
are as follows: 


* 


l. It is easier to draw a Sample and often easier to execute without 


mistakes. This is a particular advantage when the drawing is done in me 
field. Even when drawing is done in an office there may be a substantia 


saving in time. For instance, if the units are described on cards that are all 
of the same size and lie in a 


file drawer, a card can be drawn out every 
inch along the file as measured by a ruler. This operation is speedy, 
whereas simple random sampling would be slow. Of course, this method 
departs slightly from the Strict "every kth” rule, 
pling seems likely to be more precise than 
ct, it stratifies the population into n strata, 
which consist of the first k unit: i 


: € Systematic sample is spread more evenly 
` Over the population, and this fact h: 


considerably more precise than Stratified random sa 


mpling. 
206 
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x = systematic sample o = stratified random sample 
Lx — o- xo biaha —o 5, —1 
k 2k 3k 4k 5k 6k 
Unit number 


Fig.8.1 Systematic and stratified random sampling. 


One variant of the systematic sample is to choose each unit at or near 
the center of the stratum; that is, instead of starting the sequence by a 
random number chosen between 1 and k, we take the starting number as 
(k + 1)/2 if & is odd and either k/2 or (k + 2)/2 if k is even. This procedure 
carries the idea of systematic sampling to its logical conclusion. If y; can 
be considered a continuous function of a continuous variable i, there are 
grounds for expecting that this centrally located sample will be more 
precise than one randomly located. Little investigation of the efficacy of 
centrally located samples has been made for the types of population usually 
encountered in sample surveys, and attention will be confined to randomly 
located samples. 

Since N is not in general an integral multiple of &, different systematic 
samples from the same finite population may vary by one unitinsize. Thus, 
with V = 23, k = 5, the numbers of the units in the five systematic samples 
are shown in Table 8.1. The first three samples have n = 5 and the last 


TABLE 8.1 
THE POSSIBLE SYSTEMATIC SAMPLES FOR N = 23, k — 5 
Systematic sample number 


two have n = 4. This fact introduces a disturbance into the theory of 
Systematic sampling. The disturbance is probably negligible if n exceeds 
50 and will be ignored, for simplicity, in the presentation of theory. It is 
unlikely to be large even when n is small. 


8.2 RELATION TO CLUSTER SAMPLING 


There is another way of looking at systematic sampling. With N = nk, 
the k possible systematic samples are shown in the columns of Table 8.2. 
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It is evident from this table that the population has been divided into k 
large sampling units, each of which contains n of the original units. The 
operation of choosing a randomly located systematic sample is just.the 
operation of choosing one of these large sampling units at random. Thus 
systematic sampling amounts to the selection of a single complex sampling 


TABLE 8.2 
COMPOSITION OF THE k SYSTEMATIC SAMPLES 
Sampie number 


1 2 t i wee k 
Uu Y2 Yi Yk 
Yk Viro Vii Yor 
Vicayea V(n-1)k2 Vn ayeci Ynk 


Means gj, 35 Ji Ie 
Se ee tM T 


unit which constitutes the whole sample. A systematic sample is a simple 
random sample of one cluster unit from a population of k cluster units. 


8.3 VARIANCE OF THE ESTIMATED MEAN 

Several formulas have been developed for the variance of j,,,, the mean 

ofa Systematic Sample. The three given below apply to any kind of cluster 

curing in which the clusters contain elements and the sample consists of 
one cluster. 


25 is given to each of 
of the last two, the 
ction which has this 


the first three samples 
sample mean is unbias 


ysis the symbol y,, denotes the jth member of the ith 
that j = 155 


the ith sample is denoted by g, . 79 E=1,2,°-+, ks The mean of 
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Theorem 8.1. The variance of the mean of a systematic sample is 


N—1l k(n=1) 


= mE 6D 


V (Tsy) = 
where 


MES — y 
Sui M ZW S Èis y.) 


is the variance among units that m within the same systematic sample. The 
denominator of this variance, k(n — 1), is constructed by the usual rules in 
the analysis of variance: each of the k samples contributes (n — 1) 
degrees of freedom to the sum of squares in the numerator. 

Proof. By the usual identity of the analysis of variance 


W— 1)S* = > > (yi; — Y 
Pj 
E > (y; — YP + > 2 (y — X 
But the variance of 7,, is by definition 


p E 
Vll) = = EG. — Yy 
kii 
Hence 
(N — 1)S? = nkV(¥,,) + k(n — DS, 
The result follows. 


Corollary. The mean of a systematic sample is more precise than the 
mean of a simple random sample if and only if 


SES. (8.2) 
Proof. If gis the mean of a simple random sample of size n, 
N-—n s 
INE att 


Vy) = 
From (8.1), V(g,)) < V(y) if and only if 


N—1 a. kn — 1) ce N—nS 
N S N TEES 


that is, if 
N— 
k(n — DS. > > (n "OR xd = k(n — 1)S? 


This important result, which applies to cluster sampling in general, 
States that systematic sampling is more precise than simple random 
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Sampling if the variance within the systematic samples is larger than the 

population variance as a whole. Systematic sampling is precise when units 

within the same sample are heterogeneous and is imprecise when they are 

homogeneous. The result is obvious intuitively. If there is little variation 

ample relative to that in the Population, the successive 

units in the sample are Tepeating more or less the same information. 
Another form for the variance js given in theorem 8.2. 


Theorem 8.2. 


NESSUN Sed 
V(y,) = = (==) [1 + (n — 1)p,] (8.3) 


where p „is the Correlation coefficient between pairs of units that are in the 
Same systematic sample. It is defined as 


— Ey; — Yin = Y) 

Pu = Ha Leo 

Ely; — Y) 

Where the numerator is averaged over all kn 


the denominator over all N values of Vi. 
(N— 1)S?/N, this gives 


(n — 1)/2 distinct pairs, and 
Since the denominator is 


= 2 k a 1 
D Givens, Z Ou = Dy — Y) 


Proof. 
k 

n*k V(y,,) X n? yg, tud Y? 
ici 


k 
= 2 [Wa =a) (yi — yat 


The squared terms amount to the t, 
Y, that is, to (N — 1)S*. This giv 


+ Yin — Y) 


otal sum of squares of deviations from 
es 


kV (yy) = (N.— DS* + 2 IY His 


» F) Yiu = Y) ; 
Hence = (= 1s? 4. (n — 1) — D)S?p,, 
x S/N ~ 
Jg) => (ju + (n — 1)p,] 


* There is an analogue of 
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theorem 8.2 which expresses V(¥,,) in terms of the variance for a Stratified 
random sample in which the strata are composed of the first & units, the 
Second & units, and so on. In our notation the subscript j in Yi; denotes the 
Stratum. The stratum mean is written Vs. 


Theorem 8.3. 
S/N —m 
V(g. = (1) 1 =i] | 8.4 
(Yeu) aed fe [reo )Pust (8.4) 


where 
1 "E ^ 
Sost ——— — (yj; — 3.) 
UTILE 2,2) k 1 
This is the variance among units that lie in the same stratum. The divisor 
n(k — 1) is used because each of the n strata contributes (k — 1) degrees of. 
freedom. Further 
Qm E(y; = V. y. — Yu) 
r^ Ely; — yy 
This quantity is the correlation between the deviations from the stratum 
means of pairs of items that are in the same systematic sample. 


2 k (Yi — Vy. — Yu) 
= — (8.5 
V apr ms 1)(k — DŽ Ku S: ) 


The proof is similar to that of theorem 8.2.* 


Corollary. A systematic sample has the same precision as the corre- 
Sponding stratified random sample, with one unit per stratum, iori 0. 
This follows because for this type of stratified random sample V(g,,) is 
(theorem 5.3, corollary 3) 

H N —n\S2., 

và = (nj 

n 

Other formulas for V(J,,), appropriate to an autocorrelated population, 

have been given by W. G. and L. H. Madow (1944), who made the first 
theoretical Study of the precision of systematic sampling. 

Example. The data in Table 8.3 are for a small artificial population which 

exhibits a fairly steady rising trend. We have N = 40, k = 10, n = 4. Each 

column Tepresents a systematic sample, and the rows are the strata. The example 

illustrates the situation in which the "^within-stratum"" correlation is positive. For 

instance, in the first sample each of the four numbers 0, 6, 18, and 26 lies below 

he mean of the stratum to which it belongs. This is consistently true, with a 


* In the first edition slightly different definitions of p, and pus, were used. 
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i evia- 
few exceptions, in the first five systematic Se 2n He ae Bn E a 
i s ly positive. Thus the cross- 3 
s from the strata means are most! yp s K in 
"usa predominantly positive. From theorem 8.3 we IA I aM err. g 
td be less precise than stratified random sampling with one unit per str . 


TABLE 8.3 
DATA FOR 10 SYSTEMATIC SAMPLES WITH n = 4, N = kn = 40 

Systematic sample numbers Strata 
Strata See seo! 7 8. 9 10 means 
I OR lee ae a7 8 6 4.1 

II GENES IO IONISESI2 SISAIG 16 17 12.2 
Hil 181219.2201:1201:247 23:725. 28. .29'. 27 23.3 
1V 26 30 31 31 33 32 35 37 38 38 33.1 
Totals 50 58 61 63 75 71 82 88 91 88 72.7 


The variance V(¥,,) is found directly from the systematic sample totals as 


m TAM S. 
V(g,) = Vey = k PX — Y¥= mE 20 ;—nYy 


1 oes = (2) _ 11.63 
e| 69* + (58)? + + (88)? — “inl = 11.6 


For random and stratified rando 


m sampling, we need an analysis of variance 
of the population into 


"between rows" and “within rows." This is presented in 


TABLE 8.4 
ANALYSIS OF VARIANCE 
df ss ms 
Between rows (strata) 3 4828.3 
Within strata 36 485.5 13.49 = S2 


Totals 


wst 


39 5313.8 136.25 = S? 


Table 8.4. Hence the varia: 


€ nces of the estimated means from simple random and 
stratified random samples are as follows: 
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Both stratified random sampling and systematic sampling are much 
more effective than simple random sampling, but, as anticipated, syste- 
matic sampling is less precise than stratified random sampling. 

Table 8.5 shows the same data, with the order of the observations 
reversed in the second and fourth strata. This has the effect of making Prost 
Negative, because it makes the majority of the cross products between 
deviations from the strata means negative for pairs of observations that lie 


TABLE 8.5 

DATA IN TABLE 8.3, WITH THE ORDER REVERSED IN STRATA II AND IV 
Systematic sample numbers Strata 

Strata LX wi c una o 9 means 
I 0^ To 1» 290 5 ALI TNT NE SEO 4.1 
ui 17. 16 16. 192912 4181210000 NBN 12.2 
I 18 19 20 20 24 23 25 28 29 27. 23.3 
IV 38 38 37 35 32 33 31 31 30 26 33.1 
Totals 73 *74 ATA 12 132. 139413 elo) mo o> 72.7 


in the same systematic sample. In the first systematic sample, for instance, 
Er deviations from the strata means are now —4.1, +4.8, —5.3, 4-4.9. 
2 the six Products of pairs of deviations, four are negative. Roughly the 

ares applies in every systematic sample. f Sn 
brn is change does not affect V,,, and Vy. With systematic sampling, it 
ate ae a dramatic increase in precision, as is seen when the systematic 

a totals in Table 8.5 are compared with those in Table 8.3. We now 


V, 


sy 


1 $ a s e 

=— were 65g — —— | = 0.46 

al) + (74)? + +++ + (65) 10 

ns 1S Sometimes possible to exploit this result by num 

tr are negative correlations within strata. Accurate knowledge of the 

aise within the population is required. However, as will be seen later, the 
uation in Table 8.5 is one in which it is difficult to obtain from the sample 


a : 
800d estimate of the standard error of Ysy- 


bering the units to 


8.4 COMPARISON OF SYSTEMATIC WITH STRATIFIED 
RANDOM SAMPLING 


pane performance of systematic sampling in relation to that of stratified or 
m ple random sampling is greatly dependent on the properties of the 
Pulation. There are populations for which systematic sampling is 
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extremely precise and others for which it is fess precise than simple random 
sampling. For some populations and some values of n, V(¥,,) may even 
increase when a larger sample is taken—a startling departure from good 
behavior. Thus it is difficult to give general advice about the situations in 
which systematic sampling is to be recommended. A knowledge of the 
structure of the population is necessary for its most effective use. £ 
Two lines of research on this problem have been followed. One is to 
compare the different types of sampling on artificial populations in which y; 
is some simple function of i. The other is to make the comparisons for 


natural populations. Some of the principal results are presented in the 
succeeding sections. 


8.5 POPULATIONS IN “RANDOM” ORDER 


Systematic sampling is sometimes used, for its convenience, in popu- 
lations in which the numbering of the units is effectively random. This is 
So in sampling from a file arranged alphabetically by surnames, if the item 
that is being measured has no relation to the surname of the individual. 
There will then be no trend or Stratification in y, as we proceed along the 
file and no correlation between neighboring values. 

In this situation we would 
equivalent to simple random 
any single finite population, with given values of n and k, this is not exactly 
true, because V,,, which is 
erratic when k is small and may turn out to be either greater or smaller 
than V,,,. There are two 


^ Tan* 
variances are equal. 


Theorem 8.4. Consider all N! finite populations which are formed by 
the N! permutations of any set of numbers y}, Yo, - ++ Yy. Then, on the 
average over these finite populations, y 


E(V,) = e 
Note that V. 


í ran iS the same for all 
This result, proved by W. G. and L. 


n2 


» Ux) but to the average of all finite 
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populations that can be drawn from the infinite population. This approach 
may appear at first sight to have little relation to practical applications, 
but this impression is erroneous. Any sampling method is used in practice 
on a series of finite populations. One way of describing the class of finite 
populations for which a given sampling method is efficient is to describe 
the infinite superpopulation from which such finite populations might 
have been drawn at random. 

The symbol & denotes averages over all finite populations which can be 
drawn from this superpopulation. 


Theorem 8.5. If the variates y; (i = 1, 2,--*, N) are drawn at random 
from a superpopulation in which 


y= ph EY;-—Dy;-W=0 Gzj) EU- P= 4? 
then 
EV y = EV ran 


The crucial conditions are that all y; have the same mean y, that is, there 
is no trend, and that no linear corner exists between the values y; and 
y at two different points. The variance g; ? may change from point to point 
in the series. 

Proof. For any specific finite population, 


zx yy 
nz Y) 
"^ Nn N-1 


N z N M M 
à, Y= Die — HH) — Y — u) 
i=] i= 
~ 2 y 2 
= $ (n — uy — NY — p) 
i=1 
Since y, and y; are uncorrelated (i # j), 


p i 2s 
(idis qais 


Hence 
x N oj 
esset ens) 
i Nn(N — 1) Nià N 
This gives 


N 
N—n 2 
Oi 

N?n ic 


Wan = 
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Turning to V,,, let y,, denote the mean of the uth systematic sample. For 
any specific finite population, 


if ve 
Vou m y EGSC- Yy 


u-l 
=7[ 30. — ot - ar - 2] 


By the theorem for the variance of the mean of an uncorrelated sample 
from an infinite population, 


8.6 POPULATIONS WITH LINEAR TREND 


If the population consists Solely of a linear trend, as illustrated in Fig. 


8.2, it is fairly easy to guess the nature of the results. From Fig. 8.2, it 
looks as if V,, and V,, (with one unit per stratum) will both be smaller than 
Van: Further, VW 


: ill be larger than V,» for if the systematic sample is too 
low in one stratum 


1 one it is too low in all Strata, whereas stratified random 
sampling gives an Opportunity for within-stratum errors to cancel. 


3 To examine the effects mathematically, we may assume that y, — i. We 
ave 

N 

Xi NUNC) Xe NUN NENG) 


fl 2 ici 6 


» 


X = Systematic Sample 
© = stratified random sample 


Fig. 8. i i i 
18.82 Systematic sampling in a population with 


linear trend. 
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The population variance S? is given by 


- 1 = 
S=—=— (ti NE 
voi? ) 


SME [ss -DQN--1) NN+ y _ NW +1) (8.6) 
N—1 6 4 12 
Hence the variance of the mean of a simple random sample is 
Velo N —n. 5s nka N(N + 1) _ (k — 1)(N +1) (8.7) 
N n N 12n 12 
To find the variance within strata, S,2, we need only replace N by kin 
(8.6). This gives 
N—n S? nk—1,KkKk- D. (E —1 (8.8) 
N n nk 12n 12n 
For systematic sampling, the mean of the second sample exceeds that of 
the first by 1; the mean of the third exceeds that of the second by 1, and so 
En Thus the means j, may be replaced by the numbers 1,2, ^^, k. 
ence, by a further application of (8.6), 


NEC 30) 
> Go = IUE 


Va = 


This gives 


bee ie (8.9) 
Vou = LEG. Yyi Y 


From the formulas (8.7), (8.8), and (8.9) we deduce, as anticipated, 


pe E= (k—1XN +1) 
V, = = LE SS ee 
um qn er "eque ee 12 


pally occurs only when n = 1. Thus, for removing the effect of a 
aS suspected or unsuspected, the systematic sample is much more 
Ive than the simple random sample but less effective than the strati- 
ed random sample. 

can x Performance of systematic sampling in the pre: : 
Anoth Improved in several ways. One is to use à centrally locate E E 

me er is to change the estimate from an unweighted to a welg te 
an in which all internal members of the sample have weight unity 
ae, division by n) but different weights are given to the first and last 
are ers. If the random number drawn between 1 and k is i, these weights 


sence of a linear trend 


naps k=) 


1 
= Me 
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the + sign being used for the first member, the — sign for the last. For any 
i, the two weights obviously add to 2. The reader may verify that if the 
population consists of a linear trend and V = nk the weighted sample 
mean gives the Correct population mean. The performance of these end 
corrections has been examined by Yates (1948), to whom they are due. 


8.7 POPULATIONS WITH PERIODIC VARIATION 


to the period of 
Every observatio. 


that the Sample is no m ingle observation taken at ran- 
dom from the population. 


A 


SYSTEMATIC SAMPLING i 219 


three workers who are among the highest earners in the group. Similarly, a 
systematic sample of names from a city directory might contain too many 
heads of households, or too many children. If there is time to study the 
periodic structure, a systematic sample can usually be designed to capitalize 
on it. Failing this, a simple or stratified random sample is preferable when 
a periodic effect is suspected but not well known. 

In some natural populations quasi-periodic variation may be present 
that would be difficult to anticipate. L. H. Madow (1946) found evidence 
pointing this way in a bed of hardwood seedling stock in a rather small 
population (N = 420). Finney (1950) discussed a similar phenomenon in 
timber volume per strip in the Dehra Dun forest, although in a re-exami- 
nation of the data Milne (1959) suggested that the apparent periodicity 
might have been produced by the process of measurement. The effect of 
quasi-periodicity is that systematic sampling performs poorly at some 
values of n and particularly well for others. Whether this effect occurs 
frequently is not known. Matérn (1960) cites examples in which natural 
forces (e.g., tides) might produce a spatial periodic variation, but he is of 
the opinion that no clear case has been found in forest surveys. 


8.8 AUTOCORRELATED POPULATIONS 


With many natural populations, there is reason to expect that two 
observations y;, y; will be more nearly alike when i and j are close together 
in the series than when they are distant. This happens whenever natural 
forces induce a slow change as we proceed along the series. In a mathe- 
matical model for this effect we may suppose that y; and y; are positively 
correlated, the correlation between them being a function solely of their 
distance apart, i — j, and diminishing as this distance increases. Although 
this model is oversimplified, it may represent one of the salient features of 
Many natural populations. 

In order to investigate whether this model does apply to a population, 
we can calculate the set of correlations p, for pairs of items that are u units 
apart and plot this correlation against u. This curve, or the function it 
Tepresents, is called a correlogram. Even if the model is valid, the correlo- 
gram will not be a smooth function for any finite population because 
irregularities are introduced by the finite nature of the population. In a 
comparison of systematic with stratified random sampling for this model 
these irregularities make it difficult to derive results for any single finite 
Population. The comparison can be made over the average of a whole 
Series of finite populations, which are drawn at random from an infinite 
Superpopulation to which the model applies. This technique has already 
been applied in theorem 8.5. 
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Thus we assume that the observations y; (i = 1, 2,---, N) are drawn 
from a superpopulation in which 


4(y) =u, EY; — Ly = oe, EY; — DY isu — 4) = p,o* (8.10) 
where 
Pu = p, > 0, whenever u < v 
The drawing of one set of Yi 
finite population of size N. 
The average variance for Systematic sampling is denoted by 


EV, = EW, — Y? 


from this Superpopulation creates a single 


For this class of Populations it is easy to show that Stratified random 
Sampling is superior to simple random sampling, but no general result can 


be established about systematic sampling. Within the class there are 
Superpopulations in which 


random sampling, but there 
sampling is inferio i 


Theorem 8.6, If, in addition to conditions (8.10), 
ô 


we have 


f= Pe + pis — 2p, >0 li —2,3,---, (kn — 2)] 
then 


OV € OV, < EV, 


for any size of Sample. Further, unless 6? =0,i=2,3,---, (kn — 2), 


EV < EV, 
A proof has been given by Cochran (1946). 


i A sketch of the argument for n = 2 illustrates the role played by the 

Concave upwards” Condition. In the Systematic sample the members of 
the pair are always k units apart. Hence 

£V) = Mo? 4 o? 

With the Stratified sa, 

from each stratu: 

combinations i 


+ 2p,0*) = 3o*(1 + Px) 


mple, there are k Possible positions for the unit drawn 
m, making 2 Combinations of Positions. The numbers of 
2n OG (2k — 1) units apart are as follows. 

Distance || 2... (k —1 


= en MR ) k EHD). Qk 1|. Total 
Sentan Ex ym age eee 


CB ese We Lyn, 1 k2 
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Hence the average value of V(y,;), taken over the k? combinations, may 


be written 
2 [k—1 


aze + pi + pu) + + pp] 
i=1 


Similarly, &V(y,,) may be expressed as 


é VY) = 


k-1 
EV Gn) = ZIS iQ + 2p) + KC + p)| 


i= 
Hence 
2 rk— 


ul 
EV (Ge) — EVO) = ZI Silos pas 29) 
But if 
Pia + Pia 2 2p; (i = 2, 3,---) 


it is easy to show that every term inside the brackets is positive. This 
completes the proof. In short, the average distance apart is k for both the 
Systematic and the stratified sample, but on account of the concavity 
the stratified sample loses more in precision when the distance is less. than 
k than it gains when the distance exceeds k. 

Quenouille (1949) has shown that the inequalities in theorem 8.6 remain 
valid when two of the conditions are relaxed so that 


Ely) = Ms EY; — u} = o? 


In this event each of the three average variances is increased by the same 
amount. 

As far as practical applications are concerned, correlograms that are 
Concave upward have been proposed by several writers as models for 
Specific natural populations. The function p, = tanh (u~**) was suggested 
by Fisher and Mackenzie (1922) for the correlation between the weekly 
rainfall at two weather stations which are a distance u apart; the function 
Pu e by Osborne (1942) and Matérn (1947) for forestry and land use 
Surveys; and the function p, = (/ — u)// by Wold (1938) for certain types 


of economic time series. 


8.9 NATURAL POPULATIONS 


Investigations have been made on a variety of natural populations. The 
data. are described in Table 8.6. The first three studies were made from 
maps. In the first study the finite population consists of 288 altitudes at 
Successive distances of 0.1 mile in undulating country. In the next two the 
data are the fractions of the lengths of lines drawn on a cover-type map 
that lie in a certain type of cover (e.g., grass). These examples might be 
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TABLE 8.6 


NATURAL POPULATIONS USED IN STUDIES OF SYSTEMATIC SAMPLING 


Reference N Type of Data 
Yates (1948), 288 Altitudes read at intervals of 0.1 mile from 
table 13 ordnance survey map. 
Osborne (1942) * Per cent of area in (a) cultivated land, (b) shrub, 


(c) grass, (d) woodland on paraliel lines drawn 
on a cover-type map. 


Osborne (1942) * Per cent of area in Douglas fir on parallel lines 
drawn on a cover-type map. 

Yates (1948) 192 Soil temperature (12 in. under grass) for 192 con- 
secutive days. 

Yates (1948) 192 Soil temperature (4 in. under bare soil) for 192 days. 

Yates (1948) 192 Air temperature for 192 days. 

Yates (1948) 96  Yields of 96 rows of potatoes. 

Finney (1948) 160 Volume of saleable timber per strip, 3 chains 
wide and of varying length (Mt. Stuart forest). 

Finney (1948) 288 Volume of virgin timber per strip, 2.5 chains wide, 
80 chains long (Black's Mountain forest). 

Finney (1950) 292 Volume of timber per strip, 2 chains wide and of 


varying length (Dehra Dun forest). 

Johnson (1943) 4001 Number of seedlings per 1-ft-bed-width in 4 beds 
of hardwood seedbed stock. 

Johnson (1943) 400t Number of seedlings per 1-ft-bed-width in 3 beds 
of coniferous seedbed stock. 

Johnson (1943) 400t Number of seedlings per 1-ft-bed-width in 6 beds 
of coniferous transplant stock. 


z Theoretically, N is infinite, if lines that are infinitely thin can be envisaged. 
T Approximately. The number varied from bed to bed. 


considered the closest to continuous variation in the mathematical sense. 
The next three studies are based on temperatures for 192 consecutive 
days: (a) 12 in. under the soil, (6) 4 in. under the soil, (c) in air. This trio 
Tepresents a gradation in the direction of greater influence of erratic day-to- 
day changes in the weather compared with slow seasonal influences. 
The Temaining studies deal with plant or tree yields in sequences that lie 
nur a line. In the study 9n potatoes, which is typical of the group, Hi 
ite population consists of the tota] yields of 96 rows in a field. Since no 


exhaustive search of the lite 
t "x 
mr rature has been made, further data may 


In some of the studie: é i 
; S Vo is compa i i for à 
Stratified random sample wi h pared with the variance V,» 


th strata of size 2k and two units per stratum. 
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This comparison is of interest because an unbiased estimate of V,+ can be 
obtained from the sample data. This cannot be done for V,, (with strata of 
Size k and | unit per stratum) or for V,,. Other writers report comparisons 
of V,, with both V,,, and V,,,. The majority of the sources do not present 
comparisons with V,,,, in readily usable form, but it appears that in general 
V4, gave gains in precision over Vran 

In the papers by Yates and Finney comparisons are given for a range of 
values of n and k within each finite population. In these cases the data in 
Table 8.7 are the geometric means of the variance ratios for the individual 


TABLE 8.7 
RELATIVE PRECISION OF SYSTEMATIC AND STRATIFIED RANDOM SAMPLING 


Relative Precision of 
Systematic to Stratified 
Range ME 


Data of k Voul Vey Vas] Vey 
Altitudes 2-20 2.99 5.68 
Per cent area (4 cover types) = 4.42 
Per cent area (Douglas fir) = 1.83 
Soil temperature (12 in.) 2-24 2.42 4.23 
Soil temperature (4 in.) 4-24 1.45 2.07 
Air temperature 4-24 1.26 1.65 
Potatoes 3-16 1.37 1.90 
Timber volume (Mt. Stuart) 2-32 1.07 1.35 
Timber volume (Black's Mt.) 2-24 1.19 1.44 
Timber volume (Dehra Dun) 2-32 1.39 1.89 
Hardwood seedlings 14 E 1.89 
Coniferous seedlings 14-24 = 2.22 
Coniferous transplant 12-22 = 0.93 


values of k. The other writers make computations for only one value of k 
Per population but may give data for different items or for several popu- 
lations of the same natural type. Here, again, geometric means of the 
Variance ratios were taken. f : 

Although the data are limited in extent, the results are impressive. In 
the studies that permit comparison with Vyn systematic sampling shows a 
Consistent gain in precision which, although modest, is worth having. The 
median of the ratios V,4,/V,, is 1.4. The gains in comparison with V, are 
Substantial, the median ratio being 1.9. 

The internal trend of the results agrees with expectations, although 
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not too much should be made of this in view of the small number of studies. 
The gains are largest for the types of data in which we would guess that 
variation would be nearest to continuous. The decline in Vial Voy from soil 
to air temperatures would also be anticipated from this viewpoint. In the 
last three items (forest nursery data), the only one showing no gain Is 
coniferous transplant Stock, which is older and more uniform than seedling 
Stock. 


8.10 ESTIMATION OF THE VARIANCE FROM A 
SINGLE SAMPLE 


From the results of a simple random sample with n > 1, we can calcu- 
late an unbiased estimate of the variance of the sample mean, the estimate 
being unbiased whatever the form of the population. Since a systematic 
sample can be regarded as a simple random sample with n = 1, this useful 
Property does not hold for the Systematic sample. As an illustration, 
consider the “sine curve” example. Let 


; ri 
Rg DS 


where k —4and i — 1, 2,---, dm. 


> The successive observations in the 
Population are 


(m + a), m, (m — a), m, (m + a), m, (m — a), m, ++- 


If i= 1 is chosen as the first member, all members of the systematic 
sample have the value (m + 4). For the other three possible choices of i, 
all members have the values m, (m — a), or m, respectively. Thus from à 
Single sample we have no means of estimating the value of a. But the d 
Sampling variance of the mean of the systematic sample is a?/2. The gee 
tration shows that it is impossible to construct an estimated variance that ! 
unbiased if periodic variation is present, 

These results do not mean that nothing can be done. Excluding the s 
of periodic variation, we might know enough about the structure of H 
Population to be able to develop a mathematical model that adequately 
represents the type of variation present, We might then be able to mant 
facture a formula for the estimated variance that is approximately unbiase 
for this model, although it may be badly biased for other models- e 
d to use one of these models must rest on the judgment of the sa™ 

Some Simple mod, 


: h z re 
els with their corresponding estimated variances 9 
presented below. N 


9 proofs are given. 
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The simplest models apply to populations in which y; is composed of a 
trend plus a “random” component. Thus 


Yi = Bi He: 


Where yz, is some function of i. For the random component, we assume 
that there is a Superpopulation in which 


6(e;) = 0, Ee?) = 07, Eee)=0 (5j) 
A proposed formula 55," for the estimated variance is called unbiased if 
GE(ss)) = CV ey 
that is, if it is unbiased over all finite populations that can be drawn from 
the Superpopulation. 
Population in “Random” Order 
H;- constant (i =1,2,::-,N) 
N — n X (y ia) 
Nn utem il 


This case applies when we are confident that the order is essentially random 
With Tespect to the items being measured. The variance formula is the same 
as that for a simple random sample and is unbiased if the model is correct. 


2 
S853 = 


Stratification Effects Only 
Hi = constant (rk+1<igrk+k) 


N—n > Y= Vix) 
Nn 2(n — 1) 

In this Case the mean is constant within each stratum of k units. The 
estimate 555. Which is based on the mean square successive difference, is 
Jot Unbiased. It contains an unwanted contribution from the difference 
etWeen ys in neighboring strata, and the first and last strata carry too 
little Weight in estimating the random component of the variance. With 
à reasonably large sample, this estimate would in general be too high, 


assuming that the model is correct. 
Linear Trend 
li =u + Bi 


N —nn > (y, —2yax + Vas) (<i<n—2) 
ING yon 6(n — 2) 


te 
Sey = 


2 
Sua = 
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The estimate is based on successive quadratic terms in the sequence 
y; The sum of squares contains (n — 2) terms. With a linear trend we 
have seen (section 8.6) that the trend can be eliminated by the use of end 
corrections. The term n'[n? is the sum of squares of the weights in Josy- 
Unless 7 is small, n'/n? can be replaced by the usual factor 1/n. Because 
the strata at the ends receive too little weight, the estimate is biased unless 
c? is constant, but it should be satisfactory if n is large and the model is 
correct. 

If continuous variation of a more complex type is present, the preceding 
formulas may give poor results. In Table 8.8 the second and third formulas 


TABLE 8.8 
VARIANCES OF SAMPLE MEAN NUMBERS OF SEEDLINGS (JOHNSON’S DATA) 
Actual 
Bed Vey Siva Svs 
Silver maple 1 0.91 2.8 2:5 
2 0.74 3.6 2.9 
American elm 1 4.8 28.4 12.6 
2 15.5 22.6 18.6 
White spruce 1 5:5 17.2 11.2 
2 2.0 11.6 6.4 
White pine 1 8.2 21.0 21.9 


are applied to six forest nursery beds (Johnson, 1943). The quadratic 
formula is slightly better than that based on successive differences, but both 
give serious overestimates, 

Various other formulas can be devised. Residuals from a fitted poly- 
nomial of higher degree may be effective if i4; Varies continuously and not 
too rapidly: tables have been provided by DeLury (1950) for this method. 

Formulas developed from Simple assumptions about the nature O 
the correlogram have been discussed by Osborne (1942), Cochran (1946), 


and Matérn (1947). Yates (1949) has investigated an estimate based on 4 
quantity of the form 


Yu + Yuzer + UT SG SD IA e ry a ena 
The Successive items in the sample are given alternatively + and — signs 
If this expression is taken over the whole sample, only 1 df is available. 
In order to provide more degrees of freedom, the sample data can De 
broken into parts, which Yates suggests might contain nine observations 
ERE. If we denote the successive observations in the systematic sample 
by 1’, ye’, etc., and give weight 3 to the first and last terms, we may Write 


d, = (Uy, + Vs + ys! + V; + 3u) — (yo + y4 + Ve + ys) 
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The next difference, d,, may start with y,', and so on. Then, for the esti- 
mated variance of y,,, we take 


Nn 


The factor 7.5 is the sum of squares of the coefficients in any d,, and g is 
the number of differences which the sample provides (g is approximately 
n/9). In the natural populations that Yates ekamined a formula of this 
type was superior to the formula 52,9 based on successive differences, but it 
Still overestimated the actual variance of 7s. 

In conclusion, there is no dearth of formulas for the estimated variance, 
but all appear to have a limited range of applicability. 


u=1 7.5g 


8.11 STRATIFIED SYSTEMATIC SAMPLING 


We have seen that if the units are ordered appropriately systematic 
sampling provides a kind of stratification with equal sampling fractions. 
If we stratify by some other criterion, we may draw a separate systematic 
Sample within each stratum with starting points independently determined. 
This is suitable if separate estimates are wanted for each stratum or if 
unequal sampling fractions are to be used. This method will, of course, be 
More precise than stratified random sampling if systematic sampling within 
Strata is more precise than simple random sampling within strata. 

If Foyn is the mean of the systematic sample in stratum h, the estimate of 
the population mean Y and its variance are 

Vstsy E EWda. VUstsv) = Iw; Vs) 

With only a few strata, the problem of finding a sample estimate of this 

quantity amounts to that already discussed of finding a satisfactory sample 


estimate of V(7,,,) in each stratum. 
When the strata are more numerous, an estimate 
of Collapsed strata" (section 5A.11) may be preferable. From the results 


In that section, it follows that the estimate 


UG seo») = D WM — Yous 
is on the average an over- 


based on the method 


Where the sum extends over the pairs of strata, 
SStimate, even if periodic variation is present within the strata. 

An unbiased estimate of the error variance can be obtained if two syste- 
Matic samples, with a different random start and an interval 2k, are drawn 
Within each stratum, one df being provided by each stratum. There will be 
Some loss in precision if systematic sampling is effective. If there are 
Many strata, one systematic sample can be used in most of them, drawing 


228 SAMPLING TECHNIQUES 


two in a random subsample of strata for the purposes of estimating the 
error. 


8.12 SYSTEMATIC SAMPLING IN TWO DIMENSIONS 


In sampling an area, the simplest extension of the one-dimensional 
systematic sample is the "square grid" pattern shown in Fig 8.4a. The 
sample is completely determined by the choice of a pair of random numbers 
to fix the coordinates of the upper left unit. The performance of the 
square grid has been studied both on theoretical and natural populations. 
Matérn (1960) has investigated the best type of sample when the corre- 
lation between any two points in the area is a monotone decreasing concave 
upward function of their distance apart d. For correlograms like e~* the 
Square grid does well, being Superior to simple or stratified random sam- 
pling with one unit per stratum, although Matérn gives reasons for expecting 
that the best pattern for this situation is a triangular network in which the 
points lie at the vertices of equilateral triangles. 

In 14 agricultural uniformity trials, Haynes (1948) found that the square 
grid had about the same Precision as simple random sampling in two di- 
mensions. Milne (1959) examined the central square grid, in which the 
point lies at the center of the Square, in 50 uniformity trials. It performed 
better than simple random sampling and perhaps slightly better than strati- 
fied random sampling, although this difference was not statistically signi- 
ficant. These results Suggest that, at least for data of this type, autocorre- 
lation effects are weak. For estimating the area covered by forest or by 
Water on a map, Matérn found the Square grid superior to the random 
methods in two examples. 

Figure 8.4(5) shows an alternative Systematic sample, called an unaligned 
sample. The coordinates of the upper left unit are selected first by a palT 
of random numbers. Two additional random numbers determine the 


(a) Aligned. or “square grid” 
sample 


(b) Unaligned sample 


Fig.8.4 Two types of two-dimensional systematic sample. 
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herizental coordinates of the remaining two units in the first column of 
Strata. Another two are needed to fix the vertical coordinates of the re- 
maining units in the first row of strata. The constant interval k (equal to 
the sides of the squares) then fixes the locations of all points. Investi- 
gations by Quenouille (1949) and Das (1950) for simple two-dimensional. 
Correlograms indicate that the unaligned pattern will often be superior 
both to the square grid and to stratified random sampling. 

Further evidence of the superiority of an unaligned sample is obtained 
from experience in experimental design, in which the latin square has been 
found a precise method for arranging treatments in a rectangular field. 
The 5x 5 latin square in Fig. 8.5a may be regarded as a division of the field 
into five systematic samples, one for each letter. There is some evidence 
that this particular square, which is called the “knight’s move” latin square, 
is slightly more precise than a randomly chosen 5 x 5 square, probably 
because alignment is absent in the diagonals as well as in rows and columns. 

The principle of the latin square has been used by Homeyer and Black 
(1946) in sampling rectangular fields of oats. Each field contained 21 plots. 
The three possible systematic samples are denoted by the letters A, B, and 
C, respectively, in Fig. 8.55. This arrangement, with one of the letters 
Chosen at random in each field, gave an increase in precision of around 25 75 
Over stratified random sampling with rows as strata. The arrangement does 
not quite satisfy the latin square property because each letter appears three 
times in one column and twice in the other columns, but it approaches this 


Property as nearly as possible. 


Amos 
DAANA 
wwo 
rawd 
*woxOth 
ROQwBAQHA 
wBeOQdSAOQD 
AWADA 


(b) Systematic design for a 3 x 7 rec- 


(a) “Knight’s move” latin square 
tangular field 


Fig. 8.5 Two systematic designs based on the latin square. 


Yates (1960), who terms arrangements of this type lattice sampling, 
discusses their use in two- and three-dimensional sampling. In three 
dimensions each row, column, and vertical level can be represented in the 
Sample by choosing p units out of the p? in the population. With p? units in 
the sample, each of the p? combinations of levels of rows and columns, 
9f rows and vertical heights, and of columns and vertical heights can be 
Tépresented. Patterson (1954) has investigated the arrangements that 


Provide an unbiased estimate of error. 
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8.13 SUMMARY 


Systematic samples are convenient to draw and to execute. In most of 
the Studies reported in this chapter, both on artificial and on natural popu- 
lations, they compared favorably in precision with stratified random sam- 
ples. Their disadvantages are that they may give poor precision when 
unsuspected periodicity is present and that no trustworthy method for 
estimating V(¥,,) from the sample data is known. 


In the light of these results Systematic sampling can safely be recom- 
mended in tbe following situations: 


1. Where the ordering of the population is essentially random or con- 
tains at most à mild stratification. Here systematic sampling is used for 
convenience, with little expectation ofa gain in precision. Sample estimates 
of error that are reasonably unbiased are available (section 8.10). 

2. Where a stratification with numerous strata is employed and an 
independent systematic sample is drawn from each stratum. The effects of 
hidden periodicities tend to cancel out in this situation, and an estimate of 
error that is known to be an overestimate can be obtained (section 8.1 1). 
Alternatively, we can use half tlie number of strata and draw two systematic 
samples, with independent random Starts, from each stratum. This method 
gives an unbiased estimate of error. 

3. For Subsampling the units (Chapter 10). In this case it turns out that 


an unbiased estimate of the sampling error can be obtained in most 
practical situations, 


_ 8.1 The data in the Table are the numbers of seedlings for each foot of bed 
in a bed 200 ft, long, 


82 A Population of 360 households (numbered 1 to 360) in Baltimore is 
arranged alphabetically in a file by the surname of the head of the household. 
Households in which the head is nonwhi 1 

; 5 6. 58, 68, 69, 82, 83, 85, 86, 89-94, 98, 99, 101, 
107-110, 114, 154, 156, 178, 223, 224, 296, 298-300 302-304, 306-323, 325-331, 
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333, 335-339, 341, 342. (The nonwhite households show some “clumping” 
because of an association between surname and color.) 

Compare the precision of a 1-in-8 systematic sample with a simple random 
sample of the same size for estimating the proportion of households in which the 
head is nonwhite, 

8.3 A neighborhood contains three compact communities, consisting, 
respectively, of people of Anglo-Saxon, Polish, and Italian descent. There is an 
up-to-date directory. In it the persons in a house are listed in the following 
order: husband, wife, children (by age), others. Houses are listed in order 
along streets. The average number of persons per house is five. 

The choice is between a systematic sample of every fifth person in the directory 
and a 20% simple random sample. For which of the following variables do you 
expect the systematic sample to be more precise? (a) Proportion of people of 
Polish descent, (b) proportion of males, (c) proportion of children. Give reasons. 

8.4 In a directory of 13 houses on a street the persons are listed as follows. 
M = male adult, F = female adult, m = male child, f = female child. 


Household 
lye See SC i ee o "5288 
MMM NUM MoM M M M M M 
ew EF ee RA ROÜUFORE OE 
"mn if m as fe fs m m rd Ji 
Hd ga ip m Ins Jf. JA J m 
iy TA ih m 


Compare the variances given bya systematic sample of one in five persons and 
a 20% simple random sample for estimating (a) the proportion of males, (5) the 
proportion of children, (c) the proportion of persons living in professional 


8.5 In exercise 8.1 we might estimate V(9,,) by (a) regarding each systematic 
sample as a simple random sample, (5) pretending that each 1-in-20 systematic 
sample is composed of two 1-in-40 Systematic samples with a Separate random 


start. For each method, compare the average of the estimated variances with the 
actual variance of OP, 


8.6 In a population consisting of a linear trend (section 8.6) show that à 
seruat sample is less precise than a stratified random sample with strata of 
r 2k and two units per stratum if n > (4k + 2)/(k + 1). 

AA two-dimensional i : ; ted b 
the relation a penn eon with a linear trend may be represen yi 
wh J "t ac Desi (Og NS ES 

ere y,, ; 1 : : À 

= es ee. zem value in the ith row and jth column. The population contains 

Systematic square prid 
pendent startin q grid sany 


ple is selected by drawing at random two inde- 
7, contains all 


& Coordinates ig, jo, each between | and k. The sample, of size 
units whose Coordinates are of the form 


io + yk, jo + ôk 


where y, à 
between 0 and (n — 1), inclusive. 


6 are any two integers 
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_ Show that the mean of this sample has the same precision as the mean of a 
simple random sample of size n°. 
8.8 If the comparison in exercise 8.7 were made for a three-dimensional 
population with linear trend, what result would you expect? 
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CHAPTER 9 


Single-Stage Cluster Sampling 


9.1 REASONS FOR CLUSTER SAMPLING 


Several references have been made in preceding chapters to Kc 
Which the sampling unit consists of a group or cluster of smaller units t d 
we have calied elements. There are two main reasons for the Mus a 
application of cluster sampling, Although the first intention may be e 
the elements as sampling units, it is found in many surveys that no de ds 
list of the elements in the population is available and that it wou herd 
prohibitively expensive to construct such a list. . In many oe ee a 
are no complete and up-to date lists of the people, the houses, or the AT 
in any large Beographic region. From maps of the region, however, i E 
be divided into areal units such as blocks in the cities and sene 
land with readily identifiable boundaries in the rural parts. In the Uni 


constructing a list of sampling units, 
Even when a list of e : i 

tions may point to the choice of a larger cluster unit. For a given SIZ 

Sample, a small unit u 

For example, a simpl 

evenly than 20 city blocks c 


greater field costs are Incurred in locatin 


familiar Principle of Selecting the 


in a calculation of costs. 
Suggest that a small unit m, 
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about the exact boundaries of the unit. Homeyer and Black (1946) found 
that units 2 x 2 ft gave yields of oats about 87; higher than units 3 x 3 ft, 
possibly because samplers tend to place boundary plants inside the unit 
when there is doubt. Sukhatme (1947) cites similar results for wheat and 


rice. 
92 A SIMPLE RULE 


When the problem is to compare a few specific sizes or types of unit, the 
following result is helpful. 


' Theorem 9.1. This applies to simple random sampling in which the fpc 
is negligible. The quantity to be estimated is the population total. For 
the uth type of unit, let 

M,, = relative size of unit 

S,2 = variance among the unit totals 


C, = relative cost of measuring one unit 


Then relative cost for specified precision or relative variance for specified 
cost oc C,S,?/ M,?. 

Proof. Let V, be the variance of the population total as given by the 
uth type of unit. Then 
Na Su 


u 


V(P)= Vy = 


Now the relative cost for a 


The cost of taking these units is Cyr, 
st are both 


specified variance and the relative variance for a specified co 
Proportional to 


CuSu 


C,n, V, = Coa Ni Su oc TUO 


u 


since N, M, = constant for different units. This completes the proof. 


Corollary 1. If we define the relative net precision of a unit as inversely 
Proportional to the variance obtained for fixed cost, theorem 9.1 may be 


Stated as 


2 


M 
ae 61) 


relative net precision oc 


` Corollary 2. In the analysis of variance, the variances for units of 
different sizes are often computed on what is called a common basis— 
usually that applicable to the smallest unit. To put the variances on a 
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common basis, the variance S,? among totals of units of size M,, is divided 
by M,. Let 


2 H 
Si == variance among unit totals (on a common basis) 
M, 
CA Cue relative cost of taking a given bulk of sample 


Then theorem 9.1 and corollary 1 may be stated as follows: 


CSP 


relative cost for equal precision oc M3 
3 u 


x C,'S," 


(9.2) 


relative net precision oc 
(c 'S. 12 
^ u Yu 


This result shows that if differences in the costs of taking the sample arg 
ignored (i.e., assuming that C,’ is constant) the relative net precision with 
the uth unit o1/S,'2. In order to compare different units for the same 


total bulk of sample, the relevant quantities are the variances among units, 
expressed on a common basis. 


Example. Johnson's data (1941) for a bed of white pine seedlings provide a 
simple example. The bed contained six rows, each 434 ft long. There are many 
ways in which the bed can be divided into sampling units. Data for four types 


of unit are shown in Table 9.1. Since the bed was completely counted, the data 
are correct population values. 


TABLE 9.1 
DATA FOR Four Types OF SAMPLING UNIT 


Type of Unit 
EA Cee 


1-ft 2-ft 1-ft 2-ft 

Preliminary Data row row bed bed 
M, = relative size of unit 1 2 6 12 
N, = number of units in pop. 2604 1302 434 217 
S^ = pop. variance per unit 2.53] 6.746 23.094 68.558 


Number of feet of row that can 
be counted in 15 min. 44 62 78 108 
The units were 


one foot of a single row 
two feet of a single row 
one foot of the width of the bed 
two feet of the Width of the bed 
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With the first two units it was assumed that sampling would be stratified by 
rows, so that the S,? represent variances within rows. Simple random sampling 
was assumed for the last two units. 

Since the principal cost is that of locating and counting the units, costs were 
estimated by a time study (last row of Table 9.1). With the larger units, a greater 
bulk of sample can be counted in 15 min, less time being spent in moving from 
one unit to another. 

The quantity to be estimated is the total number of seedlings in the bed. In 
the notation of theorem 9.1, Table 9.1 gives the values of M, and S,?. The relative 
values of C„, expressed as the time required to count one unit, are as follows. 


1-ft 2-ft 1-ft 2-ft 


row row bed bed 
C, (in 15-min times) a oz 75 ijs 


By theorem 9.1, corollary 1, the relative net precisions are worked out in 
Table 9.2. 

The last line of Table 9.2 gives the relative precisions when that of the smallest 
unit is taken as 100. The 1-ft bed appears to be the best unit. 


TABLE 9.2 
RELATIVE NET PRECISIONS OF THE Four UNITS 
1-ft row 2-ft row 1-ft bed 2-ft bed 
M | 44 (4)(62) (36)(78) (144)(108) 
— = 17.34] ———— = 18.38 | ————- = 20.27 |- zz = 18.90 
Cy Su2| 2.537 : (2)(6.746) (6)(23.094) (12)(68.558) 
100 106 117 109 


The variances among units, expressed on a common basis, are also worth 
looking at. The values of $,? = S,?/M,, applicable to a single foot of row, are, 
Tespectively, 2.537, 3.373, 3.849, 5.713. Note that these variances increase 
steadily with increasing size of unit. This result is commonly found (although 
exceptions may occur). Since the relative net precision «c 1 ICS, the cost of 
taking a given bulk of sample must decrease with the larger units if they are to 
prove economical. 


Theorem 9.1 and its corollaries remain valid for stratified sampling with 
Proportional allocation if all strata are of the same size and If SES A 
represent average variances within strata. This is so, under the conditions 
Stated, because the variance of the estimated population total, ignoring the 
fpc, is N2S,2/n, and therefore assumes the same form as in simpie random 
sampling. Theorem 9.1 does not hold for more complex types of sampling. 

The preceding results are intended merely as an illustration of the 
general procedure. Comparisons among units should always be made for 
the kind of sampling that is to be used in practice or, if this has not been 
decided, for the kinds that are under consideration. Changes in the method 
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of sampling or of estimation will alter the relative net precisions of the 
different units. Even with a fixed method of sampling and estimation, 
relative net precisions vary with size of sample if the cost is not a lipean 
function of size or if the size is large enough so that the fpc must be taken 
into account. 

There is usually more than one item to consider. One approach is to fix 
the total cost and work out the relative net precisions for each type of unit 


TABLE 9.3 


ESTIMATED STANDARD ERRORS (%) FoR Four SIZES OF UNIT, WITH SIMPLE 
RANDOM SAMPLING 


Best 
Items S4 Sp S 2S Unit 
Number of swine S10 92419)" 45.3 62 Sp 
Number of horses 34 38 SiO 4 2 iS] 2) 
Number of sheep LCS tS. 7a) 14.9) 9514.3) 72S 
Number of chickens SON 3.08 3306 is? 1/4, 
Number of eggs yesterday 31720855205 4:9) 4T :28. 
Number of cattle ATG AR. 755. S/2 
Number of cows milked S 3o 38: 4.4 7" S[2 
Number of gallons of milk 44 ^42 44 49 Sp 
Dairy products receipts -9 8p CE AN jp 


Number of farm acres 2.9 2.8 3.0 35 S2 


Number of corn acres 


Aya ES 13:8) 4.4.6) S/2 
Number of oat acres 20A 516 7.0 “sid 
Corn yield AS Timer! 215. "S74 
Oat yield 1.6 1.5 1.6 18 Sp 
Commercial feed expenditures 126 136 167 218  Sj4 
Total expenditures, operator 718 81 96 120 Sl 
Total receipts, operator C265 7:7, 2 gis t Sia 
Net cash income, operator 6.8 


ji 6.9 7.8 9.5 S/4 
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slightly under four farms. In this comparison the total field cost ($1000), 
the length of questionnaire (60 min to complete), and the travel cost (5 cents 
per mile) are all specified, because relative net precisions change if any of 
these variables is altered. Costs are at a 1939 level. 

The data in the table are the relative standard errors (in per cent) of the 
estimated means per farm for 18 items. No unit is best for all items. The 
half-section and the quarter-section are, however, superior to the larger 
units for all except two items, with little to choose between the half- and 
quarter-sections. The half-section would probably be preferred, because 
the problem of identifying the boundaries accurately is easier. 


9.3 COMPARISONS OF PRECISION MADE FROM 
SURVEY DATA 


In the nursery seedling example the variances for the different types of 
unit were obtained from a complete count of the population. Except with 
small populations, however, it is seldom feasible to conduct a survey solely 
for the purpose of comparison. Information about the optimum unit is 
more usually procured as an ingenious by-product of a survey whose main 
purpose is to make estimates. 

Suppose that in a survey each unit can be divided into M smaller units. 
Instead of recording only the totals for each “large” unit in the sample, we 
record data separately for each of the M small units. A comparison can 
then be made of the precision of the large and small units. A simple random 
sample of size n will be assumed at first. 

The analysis of variance in Table 9.4 can be computed from the sample. 


TABLE 9.4 
ANALYSIS OF VARIANCE OF THE SAMPLE DATA (ON A SMALL-UNIT BASIS) 


Between large units 
Between small units within large 


(n — s? + n(M — Ds? 
nM —1 


Between small units in sample 


The estimated variance of a large unit (on a small-unit basis) is s,?. 
It might be thought that an appropriate estimate of the variance of a small 
unit would be the mean square between all small units in the sample, that is, 


gu = Ds? + nM — Ds 


nM —1 ca 
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This estimate, although often satisfactory, is slightly biased because the 
sample is not a simple random sample of small units, since these are sampled 
in contiguous groups of M units. f 

An unbiased estimate is obtained from the sample by constructing an 
analysis of variance, as in Table 9.5, for the whole population, which con- 
tains N large units and NM small units. 


TABLE 9.5 
ANALYSIS OF VARIANCE FOR THE WHOLE POPULATION (ON A SMALL-UNIT BASIS) 
df ms 
Between large units N-1 S, 
Between small units within 
large units N(M — 1) Sy 
Between small units in the NM —1 (N = DS + NM = DS 
population "T 


NM — 1 


By its definition, the population variance among small units is given by 
the last line of the table, that is, 


s? = (N — DS + NM — DS" 
NM—1 


pain simple random sampling, s,2 in Table 9.4 is an unbiased estimate of 
S;* (this follows from section 2.3). It may be shown easily that są? is an 


unbiased estimate of S,?. Hence an unbiased estimate of the variance S? 
among all small units in the population is 


$= ma Dat Ni = Due (9.4) 


Clearly, this expression is almost the same as the simpler expression 


Se Sot (M — Ds (9.5) 
M 
If n > 50, (9.3) for s? 


n also reduces to 81 isfacto 
approximation to S? for n > 50. (9.5), so that s? is a satis Ty 


ae UAR UN S sẹ? (for the large unit) and $? (for the small unit) are 
If the sample be and may be substituted in theorem 9.1, corollary 2- 
subsample of the la arge, the small units may be measured for a random 
units, chos rge units (say 100 out of 600). Alternatively, two small 
P en at random from each large unit, might be measured. More 
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than one size of small unit may be investigated simultaneously, provided 
that we'take data that give an unbiased estimate of S,,? for each small unit. 

With stratified sampling, the variances for the large and small units 
can be estimated by these methods separately in each stratum and then 
substituted in the appropriate formula for the variance of the estimate from 
a stratified sample. 


Example. The data come from a farm sample taken in North Carolina in 
1942 in order to estimate farm employment (Finkner, Morgan, and Monroe, 
1943). The method of drawing the sample was to locate points at random on the 
map and to choose as sampling units the three farms that were nearest to each 
point. This method is not recommended because a large farm has a greater 


TABLE 9.6 


SAMPLE ANALYSIS OF VARIANCE (NUMBER OF PAID WORKERS) 
(SINGLE-FARM BASIS) 


df ms 
Between units within strata 825 6.218 
Between farms within units 2768 2.918 
Between farms within strata 3593 3.676 


chance of inclusion in the sample than a small farm, and an isolated farm has a 
greater chance than another in a densely farmed area. Any effects of this bias will 
be ignored. 

From the sample data for individual farms, the group of three farms can be 
compared with the individual farm as a sampling unit. The item chosen is the 
number of paid workers. The sample was stratified, the stratum being a group 
of townships similar in density of farm population and in ratio of cropland to 
farmland. Since the sampling fraction was 1.9%, the fpc can be ignored. 

The variance of the estimated population total is 


292 
VP) -5% 25 


na 
The correct procedure is to compute N,25)2/n;, separately within each stratum for 
the two types of unit, using an analysis of variance and expression (9.5). We shall 


use a simpler procedure as an approximation. 3 
The strata contained in general between 300 and 450 farms, and either two or 


three 3-farm units were taken in each stratum to make the sampling approxi- 
mately proportional. Assuming proportionality, that is, n,[N, = n/N, we may 


write 
N N? 
W(f,)- = > NS? = = 52 


if we assume further that the S}? do not vary greatly among strata, so that they 
may be replaced by their average, S». 

_ Estimates of S;? are obtained from the analysis of variance in Table 9.6, which 
Is On a single-farm basis. 
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Sna? = 6.218 serves as the esti- 
For the group of three farms, the mean square 513 e s 
mate of Son p single-farm basis. For the individual farm, using (9.5), we have 


ĝa 6.218 a = 4018 


By theorem 9.1, corollary 2, the two figures, 6.218 for the group of three farms 
and 4.018 for the individual farm, indicate the relative variances obtained for 
a fixed total size of sample. The group of farms gives about two thirds si 
precision of the single farm. Consideration of costs would presumably make the 
result more favorable to the three-farm unit. 


9.4 VARIANCE IN TERMS OF INTRACLUSTER 
CORRELATION 


Variance formulas are sometimes expressed in terms of the correlation 
coefficient p between elements in the same cluster. This approach has 
already been used for systematic sampling (section 8.3). 

Let y;; be the observed value for the jth element within the ith unit, and 
let y; be the unittotal. In cluster sampling we need to distinguish between 
two kinds of average: the mean per unit Y — > y,/N and the mean per 
element Y — >y:i/NM = Y|M. The variance among elements is 


i È (us — Y? 
ETE A 
NM — 1 


The intracluster correlation coefficient p was defined (section 8.3) as 


£ Ely; — Y)Yya. — Y) = 2 2 2 Ms 23 Ya -» 


£ Ely; — Y} (M — 1)(NM — 1)S? 


(9.6) 


The number of terms (cross products) in the numerator E is NM(M — 1)/2, 
and in the denominator E is (NM — 1)S?/NM. 

Theorem 9.2. A sim 
M elements, is drawn 
Sample mean per eleme 


ple random sample of n clusters, each containing 
from the N clusters in the population. Then the 
nt j is an unbiased estimate of Y with variance 


ye aad NM - 


1 
s mvp st + (M — 1)p] 


1s 
7 S?[1 + (M — 1)g] (9.7) 


whe i i i 
Te p is the intracluster Correlation coefficient. 
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n 
Proof. Let y; denote the total for the ith cluster and y = Y y,/n. By 
theorems 2.1 and 2.2, j is an unbiased estimate of Y with variance 
yg) - C—0 ZG. — YF 
n N-1 
But 7 = Mj and Y = MY. Hence is an unbiased estimate of Y with 
variance à 


oes Leifer 3 
UT MP CAES Cs) 


But z 
(y; — Y) = (ya — Y) + Ya — Y) H (ia — Y 
Square and sum over all N clusters, 
NM 


X. - Yy- 3o; -Yy42 5 PC — Ylur — V) 
— (NM — 1)S? + (M — 1)(NM — D)eS? 


— (NM — 1)S?[1 + (M — 1)p] (9.8a) 
using the definition of p in (9.6). Substitute in (9.8) for V(y). This gives 
; i1—f NM—1 & 
^(f) = =: —— Sl + (M — Ip] 
VG wem 


This completes the proof. 


If a simple random sample of nM elements is taken, the formula for 
V(§) is the same as (9.7) except for the term in braces. The factor 


1 + (M — Dp 


Shows by how much the variance is changed by the use ofa cluster instead 
of an element as sampling unit. If p > 0, the cluster is less precise for a 
given bulk of sample. If p < 0, as sometimes happens, the cluster is more 
Precise. This result is a simple extension of theorem 8.2. 

An aiternative expression can be given for p. Let S,? denote the variance 
among cluster totals, on a single unit basis. Then 

X(,— YP = (N- DMS} 
Equation 9.8a can be rewritten as 
(N — 1)MS,2 = (NM — DSL + (M — 1)e] 
80 that 
(N — 1)MS,? — (NM — DS. Se — S? 
P= (NM — 1)(M — 1)S? (M — ps? 
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i i nd 

A good discussion of the numerical values of p for cent EAD 

different sizes of cluster is given by Hansen, Hurwitz, and Madow 5 
who regard p as a “measure of, homogeneity" of the cluster. 


9.5 VARIANCE FUNCTIONS 


In some types of surveys, for example, soil sampling, crop Sue aes 
surveys of farming that utilize an areal sampling unit, the size of t t UM 
unit may be capable of almost continuous variation. In the searc Ho. 
best unit the problem is not that of choosing between two or three sp ded 
Sizes that have been tried but of finding the optimum value of M regard 
as a continuous variable. This problem requires a method of et 
the variance S,? between units in the population as a function of M. 2 
the analysis of variance, S,2 can be found if we know (a) the variance 
between all elements in the population and (b) the variance S, pais 
elements that lie in the same unit. Our approach is to predict S,? and 
and to find S,? by the analysis of variance, : e CIO 

The sample data produce estimates of S? and So for the size o An 
actually used. Since S? is the variance among elements, it is not AR E 
the size of the unit. However, S,? will be affected, It might be expecte id 
increase as the size of the large unit increases. If the large units to b 
examined differ little in size from the unit actually used, a first Ro 
„mation is to regard S,? as constant, using the estimate given by the samp 


data. An investigation by McVay (1947) Suggests that this approximation 
may often be Satisfactory, 


As a better approximati 
Mahalanobis, 1944; He 
how S,? changes with 
S, appeared to be rela 


on, attempts have been made (Jessen, ha 
ndricks, 1944) to develop a general law to predic 
the size of unit. In several agricultural surveys, 
ted to M by the empirical formula 


S =AM" (g > 0) w 
where 4 and & are constants tha 


t do not depend on M. In this formula 
S^ increases Steadily as M incre 


ases. Usually g is small. A curve of this 

Te are forces that exert a similar influence 
on elements close together. Climate, soil type, topography, and access to 
markets tend to give neighboring farms similar features. 


2 
i is open to objection, since it makes Sw 
increase without bound asMi E , la 
elation between elements that are far apart, a er 
approaches an upper bound with large M would be m 


appropriate. However. any formula will suffice if it gives a good fit ov 
the range of M that is under investigation, 
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If this formula fits, log S,,2 should plot as a straight line against log M. 
Values of S,? for at least two values of M are needed in order to estimate 
the constants log A and g. At least three values of M are necessary for any 
appraisal of the linearity of the fit. . 

From the analysis of variance in Table 9.5 (p. 240) we find 


_ (NM — DS? — NM — DS, 


S. 2 
: N-1 
` (NM — 1)S? — N(M — 1)AM? (9.10) 
N-1 
= MS? — (M — 1)AM® (9.11) 


Hendricks (1944) has pointed out that the complete population might be 
regarded as a single large sampling unit containing NM elements. If (9.9) 
holds, then S? = A(NM)’. The advantage of this device is that the values of 
A and g can now be estimated from the data for a survey in which only one 
value of M was used. The two equations that lead to the estimates are 


log $,? = log A + g log M 
log S? = log A + g log (NM) 


The formula for S,? becomes [from (9.10)] 


52 AM'(NM — DN’ — N(M = 0] 
oa WENE 


This method furnishes no check on the correctness of (9.9). It might 
happen that the formula held well enough for small values of M but failed 
for a value as large as NM. In this event the more general formulas (9.10) 
and (9.11) should be employed. 

Formula 9.9 is presented as an example of the methodology rather than 
as a general law. The reader who faces a similar problem should construct 
and test whatever type of formula seems most appropriate to his material. 
In some cases log S,? might be a simple function of M. 


9.6 A COST FUNCTION 


In an extensive survey the nature of the field costs plays a large part in 
determining the optimum unit. Asan illustration of the role of cost factors, 
We shall describe a cost function developed by Jessen (1942) for farm sur- 
veys in which the large units are clusters of neighboring farms. 

Two components of field cost are distinguished. The component c, Mn 
Comprises costs that vary directly with the total number ofelements (farms). 
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Thus c, contains the cost of the interview and the cost of travel from 
farm to farm within the cluster. 

The second component, c/n, measures the cost of travel between the 
clusters. Tests on a map showed that this cost, for a fixed population, 
varies approximately as the square root of the number of clusters. Total 
field cost is therefore 

C=c¢,Mn + c, Vn (9.12) 
Assuming simple random sampling and ignoring the fpc, the variance of 
the mean per element j is S,?/nM. From (9.11), this equals 
S? — (M — D)AM?" 
n 

To determine the optimum size of unit, we find M, and incidentally 7, 
to minimize V for fixed C. The general solution is complicated, although 
its application in a numerical problem presents no great difficulty. 

By some manipulation we can obtain the equation that gives the 


optimum M. First solve the cost equation (9.12) as a quadratic in vn. 
This gives 


VG) = (9.13) 


m % 
2qMv/n _ (1 + sea) E (9.14) 
Cy C2 
The equation to be minimized is 
C+ IV = e Mn + egV/n + AV 
Differentiating, and noting that aV/0n = — V[n, we obtain the equations 
n: aM + 3con^*í = — ov ANIA (9.15) 
ðn n 
M: eL (9.16) 
ðM 
Divide (9.16) by (9.15) to eliminate 2. This leads to 
50) 4E TN cn 
" V 0M eM + ben” 
AE OP i UEAN (9.17) 
V ðM 1 + ey[26 M4 n 


If we substitute for Vn from (9.14), we obtain, after some simplification, 
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By writing out the left side of this equation in full and changing signs on 
both sides, we find 


AM"?[gM —(g —1)) _ , _ (1 + 369)" 

S? — (M — 1)AM** Ca 
This equation gives the optimum M. The left side does not involve any 
of the cost factors, being dependent only on the shape of the variance 
function. Both sides can be seen to be increasing functions of M, for 
g > 0, M>1, within the region of interest. Suppose that the solution 
has been found for specified values of C, c;, and c;, and we wish to examine 
the effect of an ipcrease in c, on this solution. The left side does not 
depend on c, but the right side increases as c, increases. Consequently 
the optimum value of M will decrease. A decrease in c produces a similar 
effect. 

Now c, increases if the length of interview increases, whereas c; de- 
creases if travel becomes cheaper or if the farms in a given area become 
denser. These facts lead to the conclusion that the optimum size of unit 
becomes smaller when 


length of interview increases 

travel becomes cheaper 

the elements (farms) become more dense 
total amount of money used (C) increases 


This conclusion is a consequence of the type of cost function and would 
require re-examination with a different function. It illustrates the fact 
that the optimum unit is not a fixed characteristic of the population, but 
depends also on the type of survey and on the levels of prices and wages. 

Hansen, Hurwitz, and Madow (1953) give an excellent discussion of 
the construction of cost functions for surveys involving cluster sampling. 


9.7 CLUSTER SAMPLING FOR PROPORTIONS 


The same techniques apply to cluster sampling for proportions. Suppose 
that the M elements in any cluster can be classified into two classes and 
that p, — a,/M is the proportion in class C in the ith cluster. A simple 
random sample of n clusters is taken, and the average p of the observed 
P: in the sample is used as the estimate of the population proportion P. 

It will be recalled (section 3.12) that we cannot use binomial theory to 
find V(p) but must apply the formula for continuous variates to tbe p,. 
This gives i 

DE y ES 
yg) Nat ==" D 
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Alternatively, if we take a simple random sample of nM elements, the 
variance of p is obtained by binomial theory (theorem 3.2) as 


— (NM —nM)PQ | N—nPQ 
Voin(P) = ae N nM 


if N is large. Consequently the factor 


Vp) . M X(p,— PY 
Vus p) NPQ 


shows the relative change in the variance due to the use of clusters. 
Numerical values of this factor are helpful in making preliminary WS. 
of sample size with cluster sampling. The required sample size is firs 
estimated by the binomial formula and then multiplied by the factor to 
indicate the size that will be necessary with cluster sampling. For an 
illustration, see Cornfield (1951). i 

If the cluster sizes M, are variable, the estimate p = Sa,/> M; is a ratio 
estimate. Its variance is given approximately by the formula (section 3.12) 


(N large) (9.18) 


N — n2 Mê; — PY 

Kp) ee M NH 
CE NE 1 

where M — > MN is the average size of cluster. 


If this sample is compared with a simple random sample of nM elements, 
we find, as a generalization of (9.18), 


V(p) Is ») Mp, M py 
ma == 
Voin(P) NMPQ 
As with continuous variat 


(9.19) 


es, the relationship of size of cluster to between- 
Cluster variance can be investigated, either by expressing the factor in 
(9.18) and (9.19) as a function of AZ, or by seeking a relation between the 


within-cluster variance and M. If we assign the value 1 to any unit that 
falls in class Cand 0 to 


any other unit, the fundamental analysis of variance 
equation for fixed M is 
NMP(1 — p) = M X (p, — P + MY pl — pj) 


total ss — ss between clusters + ss within Clusters 


From this relation the mean squar 
Plotted as a function of M. McVa 
be used to investi 


€ within clusters can be computed and 


Y (1947) describes how this analysis can 
gate optimum cluster size. 
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9.8 CLUSTER UNITS OF UNEQUAL SIZES 


When the cluster units contain different numbers of elements, there 
are several methods of estimating population totals and means. These 
are discussed in the remainder of this chapter. Let M, be the number of 
elements in the ith unit. As a practical point, note that in some surveys 
the values of all the M, in the population are known exactly, or almost so, 
in advance—for instance, when the elements are the employees in a firm 
with up-to-date records and the cluster units are the firm’s branches, In 
others the M; are not known, except that the M; for those units that fall 
in the sample become known during the field work. For any proposed 
estimate, the sampler must satisfy himself that he possesses the knowledge 
of the M, that the estimate demands. 

Consider first the estimation of the population total Y of the Y;, from 
à simple random sample of n cluster units. 


Unbiased Estimate 


As before, let 


denote the item total for the ith unit. By the corollary in theorem 2.1, an 
unbiased estimate of Y is 


PANS (9.20) 
n i-l 


By theorem 2.2, corollary 2, its variance is 


O: m "mp 
yf) = NO -Na a 3957) 
n N-1 


where Y = Y/N is the population mean per unit. 

The estimate Y is often found to be of poor precision. This occurs 
When the J; (means per element) vary little from unit to unit and the M; 
vary greatly. In this event the y; = Mj; also vary greatly from unit to 


Unit and the variance in (9.21) is large. 


Ratio to Size Estimate 


N. H H 
My, = Y M; = total number of elements in the population 
i=1 
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If M, is known, an alternative is a ratio estimate in which M; is taken 
as the auxiliary variate z;. 


n 


i 
a = M, &— = M, (sample mean per element) 


2M: 


In the notation of the ratio estimate the population ratio R = Y/X = 


Y/M, = Y, the population mean per element. By theorem 6.1, assuming 
that the number of clusters in the sample is large, 


N 
Fy 
. NNI — f) 2 0: - MY) 
Ce) ee 9.22) 
Ua) n Neri i 
x 267 yy 
Nu - 2 M8 7 T) (9.23) 


n N—1 


As (9.23) shows, the variance of f depends on the variability among 
the means per element and is often found to be much smaller than V(f). 
The corresponding estimates of the population mean per element are 


$ f N 2 ES S 
Y=— =—_ È Yi Pr = 2 Jo sample mean per element 
° XM, 
Note that the unbiased estimate Y requires a knowledge of Mo, whereas 
the ratio estimate requires a knowledge of only the M; that fall into the 
sample. EE 


Mean of the Unit Means 


"p third possibility is to use the unweighted mean of the unit means, 
at is, : 


os 
F= (ht 4A GGG X) (9.24) 


When the M, vary, this estimate is not only biased but inconsistent. 
The bias may, however. 


iuc > be unimportant if g, is uncorrelated with M ano 
ate is occasionally useful. T i investigate 
by Sukhatme (1954). y ul. Its properties have been investig' 
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9.9 SAMPLING WITH PROBABILITY PROPORTIONAL 
TO SIZE 


If all the M; are known, another technique, suggested by Hansen and 
Hurwitz (1943), is to select the units with probabilities proportional to 
their sizes M,. This technique has found its principal use in surveys which 
employ subsampling (Chapter 11), but it is also applicable to the present 
problem. Sampling with probability proportional to size is illustrated in 
the following example of a small population of seven units: 


Size Assigned 

Unit M; Š Mi Range 

1 3 3 1-3 

2 1 4 4 

3 11 15 5-15 

4 6 21 16-21. 

5 4 25 22-25 

6 2 27 26-27 

7 3 30 28-30 


The cumulative sum of the M; is formed. To select a unit, we draw a 
random number between 1 and 30: suppose that this is 19. In the sum 
number 19 falls in unit 4, which covers numbers 16 to 21 inclusive. With 
this method of drawing, the probability that any unit is selected is pro- 
portional to the size of the unit. 

If a second unit is to be selected, the process is repeated with a new 
random number between 1 and 30. However, contrary to our previous 
practice, we do not forbid the selection of unit 4 a second time. Selection 
with replacement is necessary, when n exceeds 1, in order to keep the 
probabilities of selection proportional to the sizes. This may be seen by 
the extreme case n = 7. If selection were made without replacement, all 
units would automatically be chosen, even though we had gone through 
the procedure of selection with probability proportional to size. For 
values of n between 1 and 7, selection without replacement leads to 
Probabilities that are intermediate between equal probabilities and 
Probabilities proportional to size. 

The advantage of sampling with replacement is that the formulas for 
the true and estimated variances of the estimates are simple. In general, 
sampling with replacement is less precise than sampling without replace- 
ment. When n/N is small, however, the chance that the same unit appears 
twice in the sample is small, and sampling with replacement is almost 
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equivalent to sampling without replacement. For situations in ag es 
N’s are small, as in stratified sampling, much research has been done ; 
recent years to develop practicable methods of sampling with unequa 
probabilities and without replacement (see section 9.14). 


9.10 THEORY FOR SELECTION WITH ARBITRARY 
PROBABILITIES 


If the ith unit is selected with probability z; = M,/M, and with replace- 
ment, we shall show that an unbiased estimate of the population total Y is 


TM egre gr) 
n 


— M, (mean of the unit means per element) (9.25) 
where M, = > M; = total number of elements in the population. Further, 
N = 
VWP) = “0S yg, — yy (9.26) 
n i=1 
so that the variance of f... like that of f», depends on the variability 
of the unit means per element. 


In some applications the sizes are kn 


own only approximately. In others 
the “size” 


is not the number of elements in the unit but simply a measure 
of its bigness that is thought to be highly correlated with the unit total y; 
For instance, the "size" of a hospital might be measured by the total 
number of beds or by the average number of occupied beds. Similarly, 
various measures of the “‘size” of a restaurant, a bank, or a farm can be 
devised. Consequently, we shall consider sampling with probability 
Proportional to an estimate or measure of size M; (ppes sampling). If 
4; = M/'[Ms where My = > Mj, it will be shown that 


m x Ls Us (9.27) 
nz 
is an unbiased estimate of Y with variance 
N 2 
Dor e aS (4 -: rý (9.28) 
ni-i zi 


Results (9.27) and (9.28) are generalizations of the results (9.25) and (9.26). 
The proofs utilize a method introduced in section 2.8. Let 1, be the 
number of times that the ith unit appears in a specific sample of size 7^ 
where f, may have any of the values 0, 1, 2,---, n. Consider the joint 
frequency distribution of the t; for all N units in the population. 
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The method of drawing the sample is equivalent to the standard proba- 
bility problem in which n balls are thrown into N boxes, the probability 
that a ball goes into the ith box being z; at every throw. Consequently 
the joint distribution of the ¢; is the multinomial expression 


n! "m 


the te... 
An Z5 ZN 
h!t! tety! 


For the multinomial, the following properties of the distribution of the 
t; are well known: 
E(t) = nz, V(t) = nz(1 — 2), Cov (t;t;) = —n22; (9.29) 


Theorem 9.3. If a sample of 7 units is drawn with probabilities z; and 
with replacement, then 
ires Up 
| ee E (9.27) 
n 


is an unbiased estimate of Y with variance 


N 2 
WPm) 2 Sad- v) (9.28) 
nici M4 
Proof. We may write 


is heh 
Frou = H(t + abt ens =- yi 
n\ % 


Ze ZN nici 2 


where the sum extends over all units in the population. In repeated 
sampling the /'s are the random variables, whereas the y, and the z, are 
a set of fixed numbers. Hence, since E(t;) = nz; by (9.29), 


E(Y. =1$ (ny #=Sy=Y 
( ar) x ni 1. p ei Jd 
Further 
1 (x)? S Vi ic tap] 
V( Popes) = Lp z) V() 23 27 Cov (ut) 


y 


Since Y z; —-1. 
Taking z; = M;/M, in theorem 9.3 gives the corresponding results for 
sampling with probability proportional to size. 
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Theorem 9.4. If a sample of n units is drawn with probabilities z, = 
M,/M, and with replacement, then 
M, ,. ME, E 
Pos = 7 Gh Qa ++ + Gn) (9.25) 
is an unbiased estimate of Y with variance 
M, = 
V,a) = — Y M(g, — Y} (9.26) 
n i=l 


Proof. Putting z; = Mi|M, in theorem 9.3, we obtain 


ie M, & Mo & 
fs =F =M s Ww. Mox. f. 
ur 2 2, n D M; n A ig 
1N 2 
Von) = zat- v) 
nic We 
1X Mi(Myy, d 
Llyl D x7] | — eel) y — Yy 
k = 22, al M, ) n EMG i 
since Y= Y|M,. 


Corollary. An unbiased estimate of Y is 


$ 1c m , 
Pos = Gh Pe E) 
with variance 


TIEN E 
vos) = —— Y MAG, — Y? (9.30) 


is next two theorems show how to estimate the variance from the 
sample. 


Theorem 9.5. Under the conditions of theorem 9.3, an unbiased esti- 
mate of V(¥,,,.,,) is 


n 3E eS 2 

O Popes) = y 9) = Y, (9.31) 
i n(n — 1) 

Proof. By the usual algebraic identity we may write 


2 p Panu) = 2 (4 D »- n( f, — Y} 


Hence 


SIE = Las EY (2 - y) nV($,,,.) 
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since, by the definition of V(Y,,,.), the mean value of the second term ori 
the right is —nV( T2): Introducing the variables t;, we have 


BS (6 - Pa] = edo v) - mro 


i=1 \2; i=l Zi 


N (y%; 2 
=n Zat T ) TZ nV (Lopes) 
iz 
that is, 
n(n — DE[(f,,,)] = nV (Lone) — nV (Popes) = nn — DV($,,) 
using (9.28) in theorem 9.3. This completes the proof. 
Theorem 9.6. If units are drawn with probability z, = M;,/M, and 
with replacement, then 


(Ppr) = Em XG 9» (9.32) 
m"  m(n—1)i 
is an unbiased estimate of V(f,,), where 7 is the unweighted mean of 


the y;. 
This result is obtained by substituting z; = M;/My in (9.31). Since 
»» = Moy, the estimated variance (apart from the multiplier) is the 
familiar sum of squares of deviations of the y; from their mean. 


9.11 THE OPTIMUM MEASURE OF SIZE 


In cases in which the measure of size M;' is some estimate of the bigness 
of the unit, a question of theoretical interest is: what measure of size 
minimizes the variance of P,,,,? Now, 


N 2 1 N ie 
V Papa) = 2S 2 (= r) -irt = re) 


This expression becomes zero if z; cc y;: that is, z, = y,/ Y. If the y; are 
all positive, this set of z; is an acceptable set of probabilities. Consequently, 
the best measures of size are numbers proportional to the item totals Yi 
for the units. 

This result is not of practical importance, for if the y; were known in 
advance the sample would be unnecessary. The result suggests that if the 
V; are relatively stable through time the most recently available previous 
values of the y, may be the best measures of size for this item. In practice, 
of course, a single measure of size must be used for all items in selecting 
the sample. If there is a choice between different measures of size, the 
measure that is most nearly proportional to the unit totals of the principal 
items is likely to be best. 
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9.12 RELATIVE PRECISIONS OF THE TECHNIQUES 


When sampling cluster units of unequal sizes, there is a choice of at 
least four techniques (assuming that the M, are known if the technique 
Tequires them). 


X 


l. Selection: equal probabilities, Estimate f or Y (unbiased). 


2. Selection: equal probabilities. Estimate Ys or Y, (ratio). 


3. Selection: probability oc size. Estimate ,,. OF Y os 


4. 'Stratify the units by size. Select with equal probabilities within strata. 
Estimate as usual for stratified sampling. 


Initially, the first three techniques will be compared. There is no simple 
general rule for deciding which is most precise. The issue depends on 
the relation (if any) between g, and M, and on the variance of J; as a 
function of M,. The situation most favorable to the ratio and pps estimates 
is that in which the mean per element is unrelated to the size of the cluster 
(7; uncorrelated with M,). In order to include populations in which 7; may 
decrease or may increase as M, increases, we adopt a model of the form 


ym 
y= M, 
where E(e, | Mj) = 0, 

Some assumption must also be made about the variance of e; in clusters 
of given size, As discussed in section 9.4, the elements in a cluster often 
have a positive correlation p, usually small, which decreases as M; in- 
creases. If we write p = Po + pi/M,, (9.7) in section (9.4) suggests that 


T (9.33) 


V(e) = Vg) = = [+ (M, — 1)p] 


= s 1 ~ po p p. 
-5 (>. + Lir — Mi 
As a simplification, we assume V(e) = p/M e where g>0. Since p 


appears to change relatively Slowly with M,, it seems likely that g lies 
between 0 and | 


l - There are, however, variables for which the unit rotal 
1S unrelated to Mi and for these 8 — 2 may be appropriate. 
The comparisons made from this mode] are based on work by Yates 


(1960), Des Raj (1954, 1958), Zarkovic (1960) and Cochran (1953), who 
used similar models. 


The comparisons are restricted to surveys in which the fpc is negligible 


and n is large enough so that the approximate formula for the variance 
of the ratio estimate holds, 
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From (9.33), it follows that 


y, — « + BM; + eM; 
Y=a+AM o (H-XMJN 


For estimating the population mean per element, we have from (9.21), on 
dividing by M, = N2A72, . 
yy. Ew — YY _ EIB(M; — M) + eM? 
nV) i? M 


vE(M?7) 


= ga 


where c? = E(M, — M)?/M? is the square of the coefficient of variation 
of the sizes Mi. For the ratio estimate, from (9.23), 


„$ EM;(gy, — Y) _ EMP[a(1/M; — 1/M) + e? 

nV(Yg) = Us Y= [ Me 

ac? VEMA) 

E 

From (9.26), for the pps estimate, 
f ) — EMG — Y) | EM[x1/M; — 1/4) + e 
nV (Yyp,) = SS = = 
QUON esed enis 

MOM M M M M 


Table 9.7 shows the results separately for g = 0, 1, 2, with the term 


involving v placed first. : ' 
Consider first « = 0, the case in which g, is unrelated to the size of 


Cluster. In nV and nV pps the second term vanishes. It is clear that 
(i) VeL Van fon g0 L2 


If B (which in this case becomes Y) is large, the superiority of the ratio 
estimate may be great. Incidentally, the case g = 1, « = 0, applies if 
elements are assigned to clusters at random; that is, if the cluster unit 
is as efficient as the element. This is also the case in which the ratio 
estimate is a best unbiased linear estimate. Further, if « — 0, 


(ii) Vereen tot doit 
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For g = 2, the comparable variances are 


v 1\ . 01+’) 
Vus = fa ar cB", 15 = M z(z-) =I "mc 
Hence pps sampling wins unless v/M? > f£. 1 i 
When « is not zero, that is, when 7; either decreases or increases as M; 
increases, the relative performances of the ratio and pps estimates to the 
simple expansion depend on the relative sizes of « and f. If f = 0, so 
TABLE 9.7 


COMPARABLE VALUES OF 71'V,4, NVR, AND nV, 


pps , 
Equal Equal 
Probability Probability 


Unbiased Ratio 
Estimate Estimate pps 


i 


Og 


ca? a? 1 zl 
0 E] E =j -= 
o DLE e) + erp) o + et) + 77. v+ ale) 
Ü v D CHR p a? 1 z] 
os a Cos Se [ig fed see 
Mya) nr SP mt mt sez) [7 


v 


pz cpt 


that the unit total y, is uncorrelated with 
beats the ratio estim 


M,, the unbiased estimate always 


ate and the pps estimate, except possibly when g — : 
As regards the comparison between V 


7 Rand V,,,, the coefficients of * 
are approximately the same in the two expressions. Hence we have 
roughly, 


Vr> V f g=0; 
Vr= Boy, if g=1; 
Vg XV, if g-2 
Realistic com 


Since the issue d 
within the stra 


A t favorable to stratified sampling), Table 9.8 shows 
the values of nY., comparable to those in Table 9.7, for proportion? 
and Neyman optim 


T ium allocation. 
he results are as follows. Note that with stratified samples the variance 
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is unaffected by the value of «. If « = 0, Voy = Ve for all three values 
of g. Further, Vprop beats pps sampling for g — 2, equals it forg =l, 
and is inferior for g = 0. If « differs substantially from zero, stratified 
sampling is superior to ratio and pps sampling. If optimum allocation can 
be achieved, stratified sampling is never inferior and nearly always superior 


to the other methods (assuming M; constant within strata). 


TABLE 9.8 
COMPARABLE VALUES OF nV g 


Proportional Optimum 


v(1 + c?) 


To summarize, if J; shows no trend or only a slight trend as M; increases, 
the ratio and pps methods are more precise than unbiased estimation with 
equal probabilities and may be much more precise. The unbiased estimate 
is superior if the unit total y; is uncorrelated with M;. There is less to 
choose between the ratio and the pps estimates. Since g is expected to lie 
mostly between 0 and 1, the pps estimate is probably more precise on the 
whole. On the other hand, the ratio estimate is easier to compute and 
less expensive if it costs more to obtain data from a large unit than from 
a small unit, since pps sampling tends to concentrate on the larger units. 
If strata can be constructed within which the M; vary little, stratification 
performs well, particularly if optimum allocation is feasible. One advan- 
tage of the ratio and pps methods is that stratification can be used for 


Some other purpose. 


9.13 EXTENSION TO STRATIFIED SAMPLING 


Selection within strata with probability proportional to an estimate 
of size is likely to be useful when a stratification has been made by some 
variable other than size. If the samples within each stratum are small 
and the total sample is not large, we have seen (section 6.10) that the 
available variance formulas for ratio estimates are somewhat suspect and 
that the separate ratio estimate may have a non-negligible bias. 
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i i the 
With ppes sampling, the estimated population total is the sum of 
estimates from the separate strata. 


A o SE Vni 
You 2h= Zo De 


> 


From theorems 9.3 and 9.5, we obtain 


i Na y 2 2 
Yu = 33 (Ue 1) 
h ny Zu 


Pana) = SS (te — z) 


a ny(n, — 1) Neyi 
9.14 SAMPLING WITH UNEQUAL PROBABILITIES 
WITHOUT REPLACEMENT 


Much interestin 


with unequal pro 
issue arises mainl 


8 work has been done on methods of selecting pu 
babilities but without replacement. In practice, tl A 
y in multistage stratified sampling (Chapter 11) in wile 
large cluster units constitute the first stage of sampling. Stratification n 
the large units may be carried to the point at which the strata contain e 
a small number of units, so that the first-stage sampling fractions nl : 
are not negligible. However, most of the methods were developed firs 
for single-stage sampling, in which the algebra is simpler. it 
Suppose that two units are to be drawn from a stratum. The first a 
is drawn with Probability proportional to size. Let the ith unit be ye 
at the first draw and let its Telative size be z,, where Ez, = 1. At the secon 1 
draw one of the remaining units is selected with probability proportiona 


to relative size, that is, with Probability z,/(1 — 2, for the jth unit. Hence 
the total Probability that the ith 


r 
unit will be selected at either the first © 
second draw is 
Tj = 2; + Y EE. (9.34) 
i*il— 2j 
N 2 
=z + » COE RNC 
gata — 7 = 2; 
- alı TEE need (9.35) 
l-z 
where A = 


2z,/(1 — z;) taken over all N 
The expected number of units to be dr. 
certain to be drawn by this process, wi 
easily verified algebraically, Thus the 
unit will appear in the Sample is z7,/2 


units. 

awn is Xm. Since two units are 
e must have Er, = 2, ice 
relative probability that the A 
=z; (say). With this method o 
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drawing, the z,' are always closer to equality than the z,. In the example 
given by Yates and Grundy (1953), with N = 4, z; = 0.1, 0.2, 0.3, and 
0.4, the z,’ are found to be 0.1173, 0.2206, 0.3042, and 0.3579. The distor- 
tion of the probabilities is not great, considering that half the units are 
selected. 

Suppose now that a sample of units is selected, without replacement, 
by an extension of this method or by some other method. Let 


7, = probability that the ith unit is in the sample 
7, = probability that the ith and jth units are both in the sample 
The following relations hold: 


S nm Sa = (n — 1); Son = in(n— 1) (9.36) 
i jii iF 


Tj: 
To establish the second relation, let P(s) denote the probability of a sample 
consisting of n specified units. Then 7m; = ZP(s) over all samples contain- 
ing the ith and jth units, and 7, = EP(s) over all samples containing the 
ith unit. When we take E,; for j # i, every P(s) for a sample containing 
the ith unit is counted (n — 1) times in the sum, since there are (n — 1) 
other values of j in the sample. This proves the second relation. The 
third relation follows from the second. 

We now show how to obtain an unbiased estimate of the stratum total 
Y and its variance and estimated variance. If z,' = 7,/n, the estimate is 


yos (9.37) 
ni z 
where y; is the measurement for the ith unit. Let 5, (i = 1,2, -: N) be 


a random variable which takes the value 1 if the ith unit is drawn and zero 
otherwise. Then ż; follows the binomial distribution for a sample of size 
1, with probability 7;. Thus 
E(t) = 7v; = nz, V(t) = v — 7) 
The value of Cov (1,1) is also required. Since fi; is 1 only if both units 
appear in the sample, 
Cov (t) = Eltyt;) — Ett) E(t;) = vi — "17; 
Hence, regarding the y; as fixed and the /; as random variables, 


N 


E(fy) = P» = Šu zx 


1-12; 
TTA fal SS ws yi 
V( T5) = Ep (5) Vt) + 2 22, 9p 7 Cov «9| 


-i[ (Sj - m2 3$ Huas- aa] 038) 


2 z 


; Dium z 
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These results were given by Horvitz and Thompson (1952). An eor 
expression for the variance can be obtained by using the first two o 
relations in (9.36). These give 


Lo; — mn;) = (n — 1)r; — a(n — m) = —7,(1 — 7) 
Hi 
Hence, substituting for 7,(1 — 7,) in (9.38), 


V(fy) = I XXe — ra| (4)+ (4)- 25 dI 


" "o 
This may be expressed as 


NN A 2 
V(fo) = = X DE (nim, — i3) (2 T z) (9.39) 
@” >i 2; 2j 
It follows that an unbiased sample estimate of this variance is 
Py) = i$ize- u(s T zj (9.40) 
niii m; e 2; 


provided that 7,, does not vanish for 
due to Yates and Grundy (1953). 


These equations Supply a sampling theory for selection without replace- 
ment. For practical application, there are difficulties. As n increases, it 
becomes harder with any method of selection to keep the z,’ close to the 
original z The quantities 7; and m; become complicated to calculate. 
The estimated variance (9.40) tends to be an unstable quantity because 
the terms (zm, — 7,)/7;; vary widely, being sometimes negative. Some 


ingenious approaches that attempt to surmount these differences are 
described in the next section. 


any pair of units. This estimate is 


9.15 ALTERNATIVE APPROACHES 

Narain (1951) constructed original probabilities of selection such that 

the final probabilities are proportional to the sizes, For example, consider 
^ — 2. If we want T; = 


S 22,, (9.35) in section 9.14 shows that the original 
probabilities Z, must satisfy the equations 


N 
= ze Zi mtus ) 
B ZINC Zi 
Methods for so 
and Grundy. Th 


lving these equations are given by Narain and by Yates 
approaches, the 


€ computations are tedious for n > 2, and, as with all 
method ultimately breaks down if n is large'enough. 
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Murthy (1957), following work by Des Raj (1956a), uses as an estimate 
the weighted expression 


p, È Plu 
M P(s) 


where P(s | i) = conditional probability of getting the set of units that 
was drawn, given that the ith unit was drawn first ; 
P(s) = unconditional probability of getting the set of units that 
was drawn. 

This method applies to any sampling plan in which the probability of 
drawing the remaining units in the sample does not depend on the order 
in which previous units were drawn, although it may, of course, depend 
on the sizes of the particular units. Under these conditions the estimate 
is unbiased, and general expressions for its variance and estimated variance 
have been given by Murthy. When n = 2, this plan has the advantage 
that the estimate of variance is always positive. The estimate then becomes 


1 yi 3 yi 
= —2)3 (12 2) 3 
Yu la 2j F ( z) z| 


2—2;—2; 
with estimated variance 


(1 — aU — 21 = 4% = 2) (4 I w) 


(2 —2; — z) 2E 


Des Raj's technique (1956b) assumes that we know the values of an 
auxiliary variate x; (which may be the sizes), such that the relation between 
y; and z; is linear. By methods of linear programming, he finds, forn = 2, 
the values of the ;; such that 7; 0C vj and that V( fy), as given in (9.38), 
is minimized. 

The two remaining methods are applications of techniques already 
discussed. The first, due to Hartley and Rao (1962), is to arrange the 
units in random order, cumulate the sizes, and draw an "every kth" 
Systematic sample from the cumulated sizes. If a sample of n units is 
wanted, we take k = Mn, draw à random number r between 1 and k, 
and select the units that contain the numbers r, 7 + k, r + 2k, etc., in the 
cumulated sizes. If any unit is larger than Mor, it has a chance of being 


Selected twice, but otherwise this plan selects with probabilities pro- 
Portional to the original sizes. The average of the y,|2; is an unbiased 


€stimate of Y. 

Hartley and Rao give expressions for the variance and estimated 
Variance for this plan in an expansion in inverse powers of N. Its efficiency 
appears to be similar to that of Narain's method, in which the probabilities 


(Êu) = 


264 SAMPLING TECHNIQUES 


also remain proportional to the original sizes. The systematic method 
avoids the computation of new original probabilities of selection. » 

Finally, we may subdivide the population into n groups and selec or : 
unit from each group with probabilities proportional to relative size 
within the group, as described in section 9.9. If the ith unit happens = 
fall in the first group, its probability of selection is 2,/Z,, where Z, = Ez; 
taken over the units in group 1. Consequently, in order to preserve the 
property of selection with Probability proportional to size, the groups 


should be formed so that as nearly as possible Zi = Z; = Zya etc. An 
unbiased estimate of Y is 


fe = ÈZ, 4 

2; 

where y,, 2, are the value a 
No unbiased estimate of Vi 
obtained by the method of 


nd the size for the unit drawn from group J. 
ariance is known, but an overestimate can be 


< n, we make k of the groups contain (Q + D 
ing (n — k) Broups have Q units each. The estimate 


portional to z; Its advantages, in 
Operation, are that explicit expressions are 
S. Rao, Hartley, and Cochran (1962) have 
shown that 


H Zi 


= (et =h) 

NENN E 1) 
where Hey is the estimate (section 9.10) for sampling with probabilities 
Proportional to z, and with replacement. The first term in parentheses 
plays the role of an fpc. The expression giving an unbiased estimate of 

Variance is 

Nt k(n—k)— Ng [a ! 2 

wf, Rue (Ui ) 

T Nn — 1) — k(n — k) i5 \z, Po 


J 


9.16 SOME COMPARISONS FOR n=2 


The case n = 2 is likely to be the most frequent as well as the simplest. 
In the choice of a method relevant factors are (a) the ease with which the 
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sample can be drawn, (b) the simplicity of the estimate, (c) the accuracy 
of the estimate and, (d) the availability of an estimate of the variance of 
the estimate. 

No extensive comparisons of the performances of the methods have 
been made, although a number of them have been applied to three small 
populations with N = 4, n = 2, constructed by Yates and Grundy (1953). 
Six methods are compared here on three populations with N = 5,n = 2 
constructed as follows. The sizes z; of the units are the same in all three 
populations (4, B, C). In A the mean per element, which is proportional 
to y,/z,, is uncorrelated with z; In B the mean per element rises as the 
sizes increase, and in C the mean per element decreases as the sizes 
increase. 


TABLE 9.9 
THREE SMALL ARTIFICIAL POPULATIONS 
Relative sizes (z,) 0.1 0.1 0.2 0.3 0.3 
Vi 0.3 0.5 0.8 0.9 15 
Population 4 
Yilzi 3 5 4 3 5 
Yi 0.3 0.3 0.8 1.5 1.5 
Population B 
Vili 3 3 4 5 5 
A Fe, a r 79s 
Yi 0.5 0.5 0.8 0.9 0.9 
Population C 
Yilzi 5 5 4 3 3 


The plans compared are as follows. All give unbiased estimates. 


l. The first unit is selected with probability proportional to z, the 
second with probability proportional to remaining sizes. Estimate: 

u = $ Xyz. 

2. The original probabilities are chosen, as proposed by Narain, so 
that 7, = 22. Estimate: Py = } Ly,/z;. 

3. Units ordered at random and a systematic sample drawn. Estimate: 

sys = $ Ly,/z,. 

4. Population divided into two groups of equal total sizes. One group 
comprizes units of sizes 0.1, 0.1, 0.3; the other, units of sizes 0.2, 0.3. 
(It makes no difference which large unit is placed with the small units), 
Estimate: Yq, = IZ, (y,/z,). 

5. Units arranged at random into one group of three units and one of 
two units. Estimate: Poo = EZ([z;). 
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6. Units selected with pps and with replacement. Estimate: f,,,— 
Ly, /z;. dod 
: Me 9.10 presents the variances. On the average, there iiie s 
choose between the first five methods, all being superior to samp m 
replacement. The average variances may be misleading because popu dad 
A may be more typical of the situations in which pps sampling is nie 
than B and C, although cases in which y;[*; is correlated with z; do occur. 


TABLE 9.10 p 
VARIANCES FOR THE ESTIMATED POPULATION TOTAL 
Estimate 

Population = Py Pave Ya fa fin 
A 0.279 0.244 0.233 0.220 0.320 kn 
B 0.434 0.252 0.273 0.300 0.256 es 

C 0.120 — 0252 0273 0300 0256 032 
Average 0.278 — 0249 — 0260 0273 0.277 0347 


In population 4 the two plans (f, and f, 
abilities of selection, appear less a 
probabilities. 


Unbiased estimates of error are available by (9.40) for fy and Y» 
with the drawbacks that the 7;; must be computed and the estimate may 
be rather erratic. The estimate Po, (subdivision into random groups) 
is the most favorably sit 


ding the estimate of error. 


c2), Which distort the prob- 
Ccurate than the three that preserve the 


uated regar 


EXERCISES 

9.1 For the data in Table 9.1 compare the relative costs of using the px 
types of unit when the object is to estimate the total number of seedlings in t 
bed with a standard error of 200 Seedlings. (Note that the fpc is involved.) 


9.2 For the data in Table 3.5 (P. 66) estimate the relative precision of the 
household to the individual for estim 


] : deg 
ating the sex ratio and the proportion 
people who had seen a doctor in the Past 12 months, assuming simple random 
Sampling, 
93. A population Consisting of 2500 elements is divided into 10 strata, oe 
containing 50 large units composed of five elements The analysis of varianc 
of the population for an item is as follows, on an element basis: 


df ms 
Between strata 9 30.6 
Between large units within strata 499 3.0 
Between elements within large units 2000 1.6 
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Ignoring the fpc, is the relative precision of the large to the small unit greater 
with simple random sampling than with stratified random sampling (proportional 
allocation)? 


9.4 A population containing LNM elements is divided into L strata, each 
having A large units, each of which contains M small units.. The following 
quantities come from the analysis of variance of the population, on an element 
basis: 


S,? = mean square between strata 
S? = mean square between large units within strata 
S3? = mean square between elements within strata 
If Ñ is large and the fpc is ignored, show that the relative precision of the large 
to the small unit (element) is improved by stratification if 
(M—-1) M 1 
S? Sf Ss 
9.5 Ina rural survey in which the sampling unit is a cluster of M farms, the 
cost of taking a sample of » units is 
C — 4tMn + 60Vn 
where t is the time in hours spent getting the answers from a single farmer. If 


$2000 is spent on the survey, the values of n for M = 1,5, 10; t = $,2, work out 
as follows. 


M 
1 5 10 
t=}hr 400 131 75 
=2hr 153 40 21 


Verify two of these values to ensure that you understand the use of the formula. 
The variance of the sample mean (ignoring the fpc) is 


S? 
am Ut ™ -De 


If p = 0.1 for all M between 1 and 10, which size of unit is most precise for (c) 
t = ihr, (b) t 2 2 hr? How do you explain the difference in results? 


9.6 If $5000 were available for the survey, would you expect the optimum size 
Of unit to decrease or increase (relative to that for $2000)? Give reasons. You 
May, if you wish, find the optimum size in order to check your argument. 

9.7 Horvitz and Thompson (1952) give the following data for €ye estimates 
Mi of the numbers of households and for the actual numbers y; in 20 city blocks 
in Ames, Iowa. To assist in the calculations, values of 3; and J/M; are also 
Siven. A sample of n = 1 block is chosen. Compute the variances of the total 
number of households Y, as obtained by (a) the unbiased estimate in samplin 
with €qual probabilities, (5) the ratio estimate in sampling with equalprobabilitiee 
(c) Sampling with probability proportional to M;. (For the ratio estimate, 
Compute the true mean square error, not the approximate formula.) i 
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Do the results agree with the discussion in section 9.12? 


ü t 
9.8 A questionnaire is to be sent to a sample of high schools to einer 
which schools provide certain facilities, for example, a course in Russian am 
Swimming pool. If M; is the number of students in the ith school, the quantity 


to be estimated for any given facility is the proportion P of high-school students 
who are in schools having the facility, that is, 


where > is a sum over those schools with the facility. 


w . 
A sample of n schools is drawn with probability proportional to M; with 
replacement. For one facility, 


a schools out of n are found to possess it. (a) 
Show that P = ajn is an unbiased estimate of P and that its true variance * 
P(1 — P)/n. (Hint. In the corollary to theorem 9.4 let y; = M, if the school has 
the facility and 0 otherwise.) (b) Show that an unbiased estimate of V(P) i 
vb) = PU — P)i(n — 1), 


seats s in which the sizes have approximately the 
same distribution as i 1 
fixed sample size? 


| à nal to the Temaining sizes. (a) In the notation 
of section 9.14 verify that z, = 51 


—-4hae-44. —2 E 2-40 

60 72 = gj, T4 = $$ and that 745 = $0» 713 6 
7m = dv. (b) Compare the va : sof Py, Psys, and Peo as defined 
im Section 9.16. (For Py and Pc», either Construct all possible estimates or us! 
the variance formulas, For Sys, construct all possible estimates.) 
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CHAPTER10 


Subsampling with Units 
of Equal Size 


10.1 TWO-STAGE SAMPLING 


Suppose that each unit in the population can be divided into a mme 
of smaller units, or elements. A sample of n units has been selecte . i 
elements within a selected unit give similar results, it seems AR 
to measure them all. A common practice is to select and measure a E 

of the elements in any chosen unit. This technique is called subsamp. re 
since the unit is not measured completely but is itself sampled. Anot = 
name, due to Mahalanobis, is two-stage sampling, because the smps d 
taken in two steps. The first is to select a sample of units, often calle 


; B : m 
the primary units, and the second is to select a sample of elements fro 
each chosen primary unit, 


Subsampling has a Breat variety of applications, which go far beyond 
the immediate Scope of 


chemical, physical 


r 
to be drawn as a subsample from a large 
e. 


In this chapter we consider the sim 
the same number M of elements, o 


The principal adva 
than one-stage sampl 


i yu ible 
age sampling is that it is more flexib 
but, unless this is the 


ing. It reduces to one-stage sampling when m = M, 
best choice for m, we have the opportunity of taking 
at appears more efficient. As usual, the issue reduce 
to a balance between Statistical precision and cost. When elements in is. 
Same unit agree Very closely, considerations of precision suggest a poss 
value of m. On the other hand, it is Sometimes almost as cheap to or 
the whole of à unit as to subsample it; for example, when the unit 1$ 
270 
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K denotes an element in the sample 


Fig. 10.1 Schematic representation of two-stage sampling (N= 81, n= 5, M = 9, 
m=2). 


household and a single respondent can give accurate data about all 
members of the household. 


10.2 TWO USEFUL RESULTS 


In two-stage sampling expected values must be found not only over all 
possible samples of n primary units but also over all possible subsamples 
that can be drawn from the selected set of primary units. Fortunately, 
there is a close relation between variences in two-stage sampling and the 
corresponding variances already obtained for one-stage sampling. Two 
general results, due to Durbin (1953), will be proved. 

_If each primary unit contains M subunits, of which m are chosen, the 
Simplest estimates of the population total and mean per subunit are, 
Tespectively, 


NM ,. i: " € l6 
UTITUR S eee. PAS Hat o> 495 


where 7, is the sample mean per subunit in the ith primary unit 
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~ Both estimates are of the form 
t y —3 Hy c9 9s 
where y; is an estimate made from the subsample drawn from the ith 
primary unit. Let 
Y; = Ey, | i) 
where the symbol £(| i) denotes a mean taken over all subsamples drawn 


from the ith primary unit. If these means were known, we could construct 
the estimate 
P=! O Y, 


This is the one-stage analogue of y’. 


The two theorems to be proved apply to primary units of unequal sizes 
as well as to those of equal sizes. They also apply when primary units 
are selected with unequal probability. The symbol 7; denotes the proba- 
bility that the ith primary unit is drawn in the sample. 


Theorem 10.1. If the primary units are drawn without replacement, 


and subsamples are chosen independently in different units, y’ is an 
unbiased estimate of 


N 

Y —XnY 
. " i 
with variance 


N 
Viv) = V?) + Xn? an, 
where x 
o, = Ely! — Y/y | i] 
is the variance of y/ in repeated subsampling from the ith unit. 

Proof. To find E(y'), average first over all samples that contain the 
Same set of primary units. This average is denoted by E(y' | pu). Clearly, 
Ey | pu) = Y' + Y/ kp y= f 
When we average further over 


ve a all selections of the n primary units, the 
term Y/ will appear with relati 


ve frequency z,. Hence 
N 
Ey) =E rY; = Y' 
For the variance, we have by definition 


VO’ = Ey?) — [E(y'yP 


2 s w^ +2 iXww) — [E (10.2) 
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Average first over samples containing the same set of n primary units. 
Now 


Ey? |D = Yi H o . (10.3) 
Further, if subsampling is independent in different units, 
E(y;y; | ii) = Yi Yy (10.4) 


Hence, substituting in (10.2) and putting f' for E (y | pu), 
Va | pw e Sv +2 RY + Zot EP (10.5) 
The conditional variance may be rewritten as 
V | pw) = P? — [ECP OP + È es? 
Now average over all selections of the primary units. This gives 


N 
V(y) = V(f^ + PX 
This proves the theorem. 


This result may be phrased as follows. In two-stage sampling the 
variance of an estimate of the form y' consists of two parts. The first, 
V(f"), is the variance obtained by replacing the estimate y; from the 
subsample in the ith unit by its mean Y;. This is the between-primary 
unit component of the variance. The second, Zm,03;^, is the sum of the 
within-unit variances of the y,’, each weighted by its probability of selection 
in the sample. 


Corollary 1. Theorem 10.1 is a generalization to two-stage sampling of 
the result given in (9.38) of section 9.14 for the variance of the estimate 
when units are chosen with arbitrary probabilities and without replace- 
ment. Revert to (10.5), and average over selections of the primary units. 
If 7,, is the probability that a sample contains both the ith and the jth 
Units, this average may be written 

N NN N N 2 

V(y) 2 X2Y?-2*XYm;YY, + 2 TiO — (è 2x 

i i j>i i 


N N N 
= [sa — m)¥,? - 2X — sex; | + Xo? (10.6) 
i j>i i 


The between-units component reduces to (9.38) if we take Y; = y,/z,’. 


Corollary 2. The reader may verify that theorem 10.1 also holds when 
Primary units are drawn with replacement, provided that, if a primary unit 
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is drawn more than once, the subsample is selected independently from 
the whole unit on each occasion. This condition guarantees that equation 
(10.4) in the proof remains valid. 

Theorem 10.2 supplies a sample estimate of V(y’), given that we have 
an unbiased sample estimate v(’) of V(Y^). In its most general terms, 
the estimate v( f") will be a quadratic of the form 


oY’) = » air. Y + 2 22 bin. WY, 


where the subscripts in a denote the fact that in complex sampling the 


coefficient of Y,'* may depend on the other units that are in the sample 
with the ith unit, and similarly for the subscripts in b. 


Let vy’) be a “copy” of (^), obtained by replacing Y; by y; wherever 

Y; appears; that is, 
vy) = Xa... + 2 ZZ bine wiy 

Theorem 10.2. Under the conditions of theorem 10.1, an unbiased 

estimate of V(y’) is 
oy’) = vy) + Y nó (10.7) 
i 

where 65^ is any unbiased sample estimate of o,,?. 


Proof. From the definition of »,(y’) and relations (10.3) and (10.4), we 
have 


n n n n 
Elvy") | pu] = È an.. Y; + 2 Y > bin... Y, Y/ dn > digg ...Dad. 
i i j>i i 


= of’) + Xa5...0,2 

i 
When the average is taken over al 
coefficient of o,? is E(a,;,. 
Tl — 7) if v(¥’) is to be 


ll selections of the primary units, the 
..). From (10.6), this average must be 
an unbiased estimate of V(Y’). This gives 


N 
Elv(y)] = VP’) + Y z(1 — moo? 
But from theorem 10.1 à 


V) = VP) + S mies 


Hence, to obtain an unbiased estimate of V(y'), we must add to vy’) 
H . N 
an unbiased estimate of > 72022. Now 


n N 
E (5 n) = 2 7702/7 
i 


This completes the proof. 
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This theorem gives the following working rule. To find an unbiased 
sample estimate of V(y’), obtain from results in one-stage sampling an 
unbiased estimate v( 2’) of V(¥’). Compute a copy of this, v,(y’), by re- 

n 


placing Y; by y, throughout. To this add the term > 7,6; where 65, 
is an unbiased sample estimate of the within-unit variance of y;'. 

Theorems 10.1 and 10.2 are more general than needed in this chapter, 
but they are given here because they are widely useful. 


10.3 VARIANCE OF THE ESTIMATED MEAN IN 
TWO-STAGE SAMPLING 
The following notation is used: 


y; = value obtained for the jth element in the ith primary unit 


m 
K=> vis sample mean per element in the ith primary unit 
iam 


nls 
=> % — over-all sample mean per element 


i=] N 
N = 
, 2h T , 
St = SN = variance among primary unit means 
N M Am 
> DMs c5 in 
S = 11 ^... — variance among elements within 
N(M — 1) primary units 


Theorem 10.3. if the units and the m subunits from each chosen unit 
are selected by simple random sampling, j is an unbiased estimate of 


with variance 
=n Si (“ — m) Se i 
Fans S N pr a M /mn wa 


Proof. In the notation of theorem 10.1 take y; = y,/n. Then 


^ = is — 
gay, YWX-lYX f-iYEX «a= 
n 


E 


From theorem 10.1 5 


E() = By) = Xu LI 
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i i ' js the 
Now by theorem 2.2 (p. 22), for single-stage sampling, since f’ is 
mean of the n values of Y, 


Na a 
E D UNS 


Nes eu) UNS n (10.9) 
np- Nn N-1 N m 
ith 
By the same theorem, since m units are selected out of M from the itl 
unit, the variance of y; = g,/n as an estimate of Y; = Y;/n is 


2_M—mS,? 
Mm m 


Oo = 
: its 1 g i i ce. 
where S,,? is the variance among subunits in the ith primary unit. Hence, 
by theorem 10.1, 


N 
VY) = V( Y^ + 5 7,05; 


_(N=n)S? rums 
F, N\ M 


1 
N at m 


m 
But S = > S,?|N. This gives 
-_(N—n)S? p — m) Se 

V LO Bab | em. 

o N n y M 


If f, = n/N and f, = m|M are sampling fractions in the first and second 
Stages, a form of the result that is easier to remember is 


mn 


Vg) = 1—f SÈ + MER X (10.10) 
n mn 


10.4 ESTIMATION OF THE VARIANCE 


If the n primary unit means Y, were known, an unbiased estimate of 
the variance of their mean f" would be 


«f - CNET- $i 
n n—1 
The copy of v( ^) from a two-stage sample is 


Y =\2 
vy’) = v7) = (1 — fi) »3 (3; — y) (10.11) 
n n—i 
zu 
For theorem 10.2, we require also an unbiased estimate of c5; !? 
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equation (10.1). Since subsamples are chosen by simple random sampling, 
this is given by 

- M—ms? (-—fJs? 
&i- i e uo Grit 10.12 
E M mr? mn* ( ) 
where 
2 (us — I 
sa Eun. M 
. m—1 


Theorem 10.4. Under the conditions of theorem 10.3, an unbiased 
estimate of V(ij) is 


wy)- LÁ, 1° HOC, ? (10.13) 
where 
m ; (y; — DF 2_ 55 (y — HP 
Am n—i uh 22 hm — 1) goan 


Proof. By theorem 10.2, an unbiased estimate of V(Ẹ) is 
v(y) = v (9) + > Tiba 


Using (10.11), (10.12), and 7; = n/N, this gives 


1— 
vi = p+ 5t sy 
But s,?, as defined in (10. i E P oat Hence 
1— 
9) = 4 Ad = fr) — 9). 
mn 


Corollary. A result that will be used later is 


2 2 
E(s,?) = S? — + + z (10.15) 


Proof. Since v(Ẹ) is an unbiased estimate of V(7) , (10.13) gives 


1 2 E(s2) = E Ss? pa hs, 2 -tl AC — fs, 2 
me i " s; Sy 
aora ae M 3s Aj 


It follows that an unbiased estimate of Sj? is [s? — s,2(1 — fam]. 
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Notes on Theorem 10.4. If m= M, that is, f; = 1, formula 10.13 
becomes that appropriate to simple random sampling of the units. If 
n = N, the formula is that for proportional stratified random sampling, 
since primary units may then be regarded as strata, all of which are 
sampled. In this connection, two-stage sampling is a kind of incomplete 
stratification, with the units as strata. j 

In the common situation in which f, = n/N is negligible, we obtain 
the useful result, 


"em 
W= E= als = (10.16) 
n nin — 


Thus the estimated variance can be computed from a knowledge of the 
unit means only. This result is particularly helpful when subsampling 
is systematic, because in this event we cannot compute an unbiased 
estimate of S,?, But (10.16) still applies, provided that n/N is small. If 
n/N is not small, it is easily seen that (10.16) gives an overestimate. 


10.5 THE ESTIMATION OF PROPORTIONS 


If the elements are classified into two classes and we estimate the 
Proportion that falls in the first class, the preceding formulas can be 
applied by the usual device of defining y; as 1 if the corresponding 
element falls into this class and as zero otherwise. Let p; = a,/m be the 
proportion falling in the first class in the subsample from the ith unit. 


The two estimated variances sy? and s? required for theorem 10.4 work 
out as follows: 


È (p: — py 
si = i=l 
n—1 
: à 
$4 = ; 
Es pàs: 


where p = Xp[n. Consequently, by theorem 10.4, 


- Su n 
v(p) = 1 (p: — p)? + AU — fe) y 

n(n — 1) ^ p n*(m — 1) 2 Pdi 
x ume. In a study of plant disease the plants were grown in 160 small 
p containing nine plants each. A random sample of 40 plots was chosen and 
ree random plants in each Sampled plot were examined for the presence O 


disease. It was found that 22 lots h 4 11 
d no diseased plants (out of three), 

had one, four had two, and tire had i 3 P B disease 

plants and its s.e. The symbol ¢ ae Dium ne proportion er di 


denotes the frequencies 22, 11, 4, 3. 
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We have N = 160, M — 9, n = 40, m — 3. In finding s;? and s,%, it is con- 
venient to work at first with the numbers of diseased plants (3p;) and the numbers 
of healthy plants (3q;). The calculations are set out as follows: 


Frequency 
3p: $ 9piqi Pigi — 3ép, Ip 
0 22 0 0 0 0 
1 11 2 22 11 11 
2 4 2 8 8 16 
3 3 0 0 9 27 
40 30 28 54 


3D $7: _ 28 _ 9933 


dag ig 0 


E 1 Q8? _ 
Xp: — 5} = o - ud = 3.822 


Hence, from the formula immediately before this example, 
» (3)(3.822) (2)(3.333) 
= + = 0.00201 = 
D = canyao) * (61600) "S 
The proportion diseased is 0.233 with s.e. 0.045. The approximate formula 
5,/ Vn, from (10.16), gives 0.049, a reasonably good estimate considering that 


17 


10.6 OPTIMUM SAMPLING AND SUBSAMPLING 
FRACTIONS 


These depend on the type of cost function. If travel costs between 
units are unimportant, one form that has proved useful is 
C = cn + cnm 
The first component of cost, cn, is proportional to the number of 
Primary units in the sample; the second, cymm, to the total number of 
second-stage units or elements. From theorem 10.3, V(¥) may be written 
2 
Vg) = Hs z x + = Si is? (10.17) 
The last term on the right does not depend on the choice of n and m. 
Minimizing V for fixed C, or C for fixed V, is equivalent to minimizing 
the product 


1 s; ol S 
(v+ N s;)¢ =c (se — x ose F = GS, + mea (s — = 
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Note that the first two terms are constant, whereas the last two depend 
on m but not on n. The minimizing value of m may be found by om 
entiation. But since, in practice, m must be an integer and is often small, 


a more accurate approach suggested by Eisenhart (Cameron, 1951) is 
used. Write 


= cS? b= a(s- S) 
(C199, = Ca) 91 M 
We wish to find an integer m such that 


a 
Z 4 bm < —— + b(m +1), _ that is, m(m + 1) 27 
m mal b 
a a d a 
— + bm € — + b(m — 1), that is, m(m — 1) < — 
m m—1 b 

These relations give the following rule. Compute 
mo = Valb = VP ee V cles (10.18) 


Vs? — SgJM 
If m,,, lies between the integers m, m + 1, choose (m + 1), that is, round 
upward, if m2, m(m-+ 1); otherwise round downward. Thus, if 
Moy, lies between 1.414 = 4/2 and 2, we round upward to 2. If mop: 1$ 
greater than M, or if S? is less than S,2/M, we take m = M and employ 
one-stage sampling. 

The structure of (10.18) is as expected. If elements were assigned to 
units at random, that is, if the primary unit were as efficient as the element, 
the variance of a primary unit mean would be S.2/M, so that (S? — Sy] M ) 
would be zero. This gives m,,, = infinity, that is, complete enumeration 
of the primary units. Conversely, the greater the variance S,? among 
primary unit means relative to that within primary units, the smaller the 
value of m,,,. The greater the cost €, of access to the unit relative to the 
cost c, of obtaining data from any element in the unit, the higher the 
optimum m. 

The value of n is found by solving either the cost equation ‘or the 
variance equation, depending on which has been preassigned. 

In most practical situations the optimum is relatively flat. An error of 
a few units in the choice of m produces only a small loss of precision, as 
the following example illustrates. 

In the terminology of the analysis of variance, the quantity (S,2 — Sz'/ M) 


Which appears in the denominator of m,„, is known as the component of 
Variance between unit means. Write 


S -si- u (10.19) 
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Example. Let 
cı = 10c}, S,=1.3S, 
then 
mp = 1.3 VIO = 4.1 
We will regard total cost as fixed and see how the variance of y changes with m. 
N is assumed large. From (10.17), à 


S Si 
A = 
n nm 
AEA 
- (sp + Syn 


eliminating n by means of the cost equation. This gives 
S A 2 1.69 
va) = wile (4 Se CHE -Sv(, 4 1:69) o + mi 
C C m 
Omitting the constant factor, the relative variance can be calculated for different 


values of m. Table 10.1 shows these variances and the relative precisions (with 
the maximum precision for m = 4 taken as the standard). 


TABLE 10.1 
RELATIVE VARIANCES AND PRECISIONS FOR DIFFERENT VALUES OF m 
1 2 3 4 5 6 7 8 9 10 


m= 


29.59 22.14 20.32 19.92 20.07 20.51 21.10 21.80 22.56 23.38 
0.67 0.90 0.98 1.00 0.99 0.97 0.94 0.91 0.88 0.85 


Rel. variance 
Rel. precision 


For any value of m between 2 and 9, the loss of precision relative to the opti- 
mum is less than 12%. 

In practice, the choice of m requires estimates of ¢,/cy and S/S, or 
equivalently S,/S,. Because of the flatness of the optimum, these ratios 
need not be obtained with high accuracy. If c,/c, is known reasonably 
well and a value of m, say m, has been selected, a useful table (Brooks, 
1955) shows the range of values of S,?/S,? within which this mọ gives a 
Precision at least 90% of the optimum. 

The table was obtained as follows. For given cost, assuming N large, 
the relative precision of m, to Mops is found to be 


VG | Mop) _ (Suvey + Spe (10.20) 

V| m)  S5« + Sce + mS, + Sm, ; 
The set of values of S,/S,, for which this expression exceeds some assigned 
level L are those lying between the two roots 


Sy y & VIU — Dm + yim) 
Su (Ly?|m,) — (1 — L) (10.21) 


where y = Cica. 
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Table 10.2, adapted from Brooks (1955), shows the lower and upper 
limits of S,2/S,2 for L = 0.9. The wide interval between the lower and 
upper limits is striking in nearly all cases. Note that the range of mo 
changes in different parts of the table. 


TABLE 10.2 
Limits FOR S,?/S,,2 WITHIN WHICH mg Gives AT Least 90% or THE MAXIMUM 
PRECISION 
O/C. = i 1 cilc = 2 4 

mo L U D U mo n U I U 

D zi Cd 
1 00 11 00 4 2 OS tee e 20230054 
2 2.0 98 elie 22) 3 1:229 2]T4|- (O85 8 
3 Ce 2d Gp) 4 20R TAA |. 11.0, 6 916. 
4 66 > 408 > 5 333) 382-| 1.6. 27, 
5 95; > 59 > 6 4T. c424 .42 
6 13 > 8.1 > 7 6.3 2-310313 2 Ol 
7 16e S 11 > 8 8098 7| 543: 187 
8 20 > 13 > 9 10 > 5.4 > 
— 

"E 


* > denotes “> 100." 


MID we have a rough idea about the values of S.2/S,2 for the principal 
items in a survey, Table 10.2 may be used to select a value my. Note that 
if p is the correlation between elements in the same primary unit, as 
defined in section 9.4, the ratio S/S, is nearly equal to (1 — p)/p. A 
value of S,2/S,2 as low as 1 corresponds to p — 0.5. This would be an 
unusually high degree of intraunit correlation. Similarly, p — 0.1 gives 
S/S, = 9, whereas p = 0.01 gives S,2/S,2 = 99, 

Example. Suppose that [2173 


between 5 and 100 for the 
Satisfactory choice, 


&j/c; is about 1 and that S,?/S,? is expected to lie 
r the principal items. The columns c/c; = 1 give my = 4 asa 
since this covers ratios from 4 to more than 100 (actually to 
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196). With c/c = 16 and the same desired range, the table suggests a value of 
my somewhere between 15 and 20. Further calculation from (10.21) shows that 
mo = 18 is best. This covers the range from 5.2 to 84—not quite so wide as 
desired. 


When the cost of travel between primary units is substantial, a more 
accurate cost function may be 


C=cn+ evn + conm (10.22) 


since travel costs tend to be proportional to Vn. If a desired value of 
V(¥) has been specified, pairs of values of (n, m) that give this variance 
are easily computed from (10.17) for V(g). The costs for different com- 
binations are then computed from (10.22) and the combination giving 
the smallest cost is found. When cost is fixed in advance, Hansen, Hurwitz, 
and Madow (1953) give a method for determining the (n, m) combination 
that minimizes the variance and a table that facilitates a rapid choice. 
Note that their n is our m and vice versa. 


10.7 ESTIMATION OF m,,, FROM A PILOT SURVEY 


Sometimes estimates of S, and S,? or S,? are obtained from a pilot 
survey in which n’ primary units are chosen, with m’ elements taken from 
each unit. This section deals with the choice of n’ and m’. If sj? is the 
variance between unit means and sọ? is the variance between elements 
within units, as defined in section 10.4, (10.15) gives 


2 2 S. 2 
sene (8-8) 48-5248 0 qo) 


For the simple cost function cy + canm, we had 
So lade 
Moot = = V clc 
opt Vs? zi S2/M 
As an estimate of m,,, from the pilot survey, (10.23) suggests that we take 
opt — A Mele = A NS (10.24) 
Vs? — sj |m Vim si [s ) — 1 
The estimate 7f,,, is subject to a sampling error that depends on the 
sampling error of the ratio s,?/sj?. From the analysis of variance it is 
known that m's?/s,? is distributed as 
Sue 
F(1+ m3) 


2 


^ 
m, 
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where F has (n' — 1) and n'(m' — 1) degrees of freedom, provided that 
the y;; are normally distributed. This result leads to the sampling distribu- 
tion of ,,, for given values of n' and m’, that is, 


Figg ORTES ^— (1025) 
opt — 
m'S 
EST 
F(1+ s) 


Example. For the example in section 10.6, in which 
€ =10c,, — Sp = 1.3S,, — m, —13V10 = 4.1 


consider how well m,,, is estimated from a pilot sample with n’ = 10 and 
m' = 4. From (10.25), 


^ 6.324 6.324 
= TTE —P—— 
eL m OD Vase oi 


where F has 9 and 30 df. To find the limits within which rij, will lie 80% of the 
_ time, we have, from the 10% one-tailed significance levels of F, 
F 10 (9, 30) = 1.8490, Fo (9,30) = 1 [F 19(30, 9) = 1/2.2547 = 0.4435 
Substitution of these values of F gives 


lower limit, Mont = 2.8; upper limit, Mont = 9.0 


As shown previously in Table 10.1, an 
close to the optimum. Thus, 
the loss of precision is small. 


The 80 and 95% limits for n’ = 5, 10, 20 and m = 4 appear in Table 10.3. 


y m in this range gives a degree of precision 
with n" = 10, m' = 4, the chances are 8 in 10 that 


TABLE 10.3 
Lower AND UPPER Limits FOR Hoyt 
n 80% 90% 


= eee 
5 2.5, œ 1.8, œ 
10 28,90 23, œ 
20 31,64 27,91 
E rd 7 an 


With n’ = 20, we are almost certain to estimate m, with precision close to the 
optimum. This is not so with n’ = 5. 


If the ratio c/c, is the same in the pilot survey as in the main survey, 
the cost of the pilot survey will be proportional to cn’ + cgn'm'. Brooks 
(1955) gives a table of the values of (n', m^) in the most economical pilot 


Survey that provides an expected relative precision of 90% in the estimation 
of mop. Table 10.4 Shows part of this table. 
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TABLE 10.4 
PILOT SAMPLE DESIGNS HAVING AN EXPECTED RELATIVE PRECISION OF 90% 


The computations assume that N and M are large: the designs are 
conservative if fpc terms are taken into account. Note that no more than 
10 primary units are required and that the designs are relatively insensitive 
to the ratio c,/c. 


10.8 THREE-STAGE SAMPLING 


The process of subsampling can be carried to a third stage by sampling 
the subunits (elements) instead of enumerating them completely. For 
instance, in surveys to estimate crop production in India (Sukhatme, 
1947), the village is a convenient sampling unit. Within a village, only 
some of the fields growing the crop in question are selected, so that the 
field is a subunit. When a field is selected, only certain parts of it are cut 
for the determination of yield per acre: thus the subunit itself is sampled. 
If physical or chemical analyses of the crop are involved, an additional 
subsampling may be used, since these determinations are often made on 
a part of the sample cut from a field. 

The results are a straightforward extension of those for two-stage 
Sampling and are given briefly. The population contains N first-stage 
units, each with M second-stage units, each of which has K third-stage 
units. The corresponding numbers for the sample are n, m, and k, 
respectively. Let y;;, be the value obtained for the uth third-stage unit in 
the jth second-stage unit drawn from the ith primary unit. The relevant 
Population means per third-stage unit are as follows: 


K MK N MK 
= > Yiju > > Yisu = » » >: Viu 
meam t : ym fu 


286 SAMPLING TECHNIQUES 
The following population variances are required: 
N — - 
XQ -Yy 
ee 
N-1 
NM 


XXG,-Yy 
N(M — 1) 


Si 


Sj = 


NMK oe 
» > È (Vin — Y;) 
 . NM(K-1) 
If simple random sampling is used at all three stages, 


the sample mean 5j per third-stage unit is an unbiased estimate of Y with 
variance 


2 
3 


Theorem 10.5. 


V@) = USA S24 ahs S24 1 = Sag (10.26) 
n nm nmk 


where f, = n/N, f, 
three stages. 


Proof. Only the Principal steps are indicated, Write 


= m[M, f, — k/K are the sampling fractions at the 


)-Y-6-YZ2)4(.- F) 4 (F,— Y) 
where Yon is the population mean of t 


he nm second-stage units that were 
selected and Y, is the 


Population mean of the n primary units that were 
uare and take the average, the cross-product terms 
ions of the Squared terms turn out to be as follows: 


EG — Yan)? =h g 2 


n 
When these three terms are added, the theorem is obtained. 


Theorem 10.6, An unbiased estimate of V@) from the sample is 


u(y) = EE fy s? + fi — fe) s? 4 fl — fa) s (10.27) 
n nm nmk 
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where 5,?, sa, s? are the sample analogues of 5,2, S,?, S3, respectively. 
Proof. This may be proved by the methods in section 10.4 or alter- 
natively by showing that 


— fo TU 
E(s°) = S? + Lek Sj. ‘=f Ss (10.28) 


E(s,") = S? ar A S, 


and E(s,?) = S}. To obtain the first result, let J; denote the mean over 
the m second-stage units in the ith primary unit, given that all K elements 
were enumerated at the third stage. Let Yx be the mean of the n values 
Wig. Then from (10. 15) for two-stage sampling, it follows that 


[È (Vix = uÈ] = S, + 1 —h S, 
n—i m 
Now, if 7, is the sample mean for the ith primary unit, write 

G, -D = Gu — Gx) + IG. — Vix) — G — 9:21 


By first averaging over samples in which the first-stage and second-stage 
units are fixed, it is easily shown that 


ab Los sa 
Yo (viu) s 
"s "Lb (9; — $a) — (v — v x)l SES fs) 
and that the sum of the cross-product terms contributes nothing. This 
establishes the result for E(s,2). That for rae is found similarly. Hence 


Eloy] = L—(se gi hss 241 Ls) 
+AU 50 — (s n Ls) Ds, 


fene Ese. L Lh s= vi 
n nm 


As with two- -stage sampling, it is clear from (10.27) that if f, is negligible 
oy) reduces to 


2 os DO-T 
Wo KO (10.29) 


This estimate is conservative if f; is not negligible. 
With a cost function of the form 


C = cn + cnm + cgnmk 
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the optimum values of k and m are 
== 2 SK ,— 

e 70: NN N gl Cg, m SE Sik Valea (10.30) 
VS = SPIK VS? — Sf|M 
The extension of the results in this section to additional stages of sampling 

should be clear from the structure of the formulas. 


10.9 STRATIFIED SAMPLING OF THE UNITS 


Subsampling may be combined with any type of sampling of the primary 
units. The subsampling itself may employ stratification or systematic 
sampling. Variance formulas for these modifications can be built up 
from the formulas for the simpler methods. ] 

Results are given for stratified sampling of the primary units in a two- 


Stage sample. The primary unit sizes are assumed constant for a given 
stratum but may vary from stratum to Stratum. This situation occurs 
when primary 


units are stratified by size so that sizes within a stratum 
become constant or nearly so. 


The Ath stratum contains Ny 
units; the corresponding samp 
population mean per second-st 


primary units, each with M, second tiis 
le numbers are n, and m,. The estimate 


age unit is 
È NM, 
ne =>, aonn 
ÈN, M,  " 
h 


where W, = N,M,JXN, M, is the relative size of the stratum in terms of 
second-stage units and jj, is the sample mean in the stratum. By applying 
theorem 10.3 within each stratum, we have 


V => w Ee Sy + 1— fy s?) (10.32) 
h h 


D nm, 
where fj, = IN ny fon = m,|M,, 
From theorem 10.4, a 


n unbiased sample estimate is 


Fw) = Y W? [=a sp? + Ln Sa) A (10.33) 
‘g Mr nM, 
Correspondin 


variances for the estim 
by multiplyin 


ated population total are obtained 
g formulas (10.32) and (1 


0.33) by (ZN,M,)?. 


10.10 OPTIMUM ALLOCATION WITH STRATIFIED 
SAMPLING 
This deals with the best choice of the m, and the My 
et 


If travel costs 
ween units are not a Major factor, the cost may 


be represented 


SUBSAMPLING WITH UNITS OF EQUAL SIZE 289 
adequately by the formula 
C = Y cn, + DY cy num, (10.34) 
h h 
From (10.32), the variance may be rewritten as 
a 1 Sere 1 1 
VG) = 2 (s r$) — sS; deem] 
( st. 2 h n, lh M. 2h N, 1h 


h nm, 


The quantity 
VG) + x Cinny + È cenam, E c) 


where 2 is a Lagrange multiplier, is a function of the variables n, and 
(nama). Hence, to minimize V for fixed C, or vice versa, we have 


nA = 5 [Sut — SIM, (10.35) 
Cir 
nım, VA = PAS (10.36) 
Con 
These give 
m= a apo 


V Sy? — Sy?|M, 


The formula for optimum m, is exactly the same as in unstratified sampling 
[(10.18) in section 10.6]. 
From (10.35), since W, oc N,M, 


2 
n, oc iMi S pere S im Sy*—5® —— (1037) 
Cu h 


It will be recalled that in one-stage stratified sampling (section 5.5), 
the optimum 7, is proportional to N.SyN Crs where S, is the standard 
deviation among unit totals and c, is the cost per unit. Referring to 
(10.37), the quantity ,,? is the component of variance among primary 
unit means, as explained in section 10.6. Hence M,S,, in (10.37) may be 
regarded as a kind of standard deviation among primary unit totals, 
except that we are now dealing with the component of variance rather 
than the total variance. 

Since self-weighting estimates are convenient, we consider under what 
circumstances the optimum allocation leads to a self-weighting estimate. 
From (10.31), it follows that 7,, is self-weighting if n,m,/N, M, =f= 
constant, since in this event 


mh ma 

> N,M,[n m, 22 > Vois 2 22 2 Viii 2 > > Vii; 
S pr = i = luy = 9 
> NM, fo È NM, È mm, y 


Yst 
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The condition is, as might be expected, that the over-all sampling fraction 
fo be the same in all strata. 
From (10.36), the optimum allocation gives 


TE N, My Son 
SINE we oes, 

Frequently cj, the cost per second-stage unit, will be approximately 
the same in large and small primary units; but Sy, may be greater in large 
units than in small. However, since the optimum is flat, a self-weighting 
sample will often be almost as precise as the optimum. Note that this 
result holds even if the optimum sampling of primary units is far from 
proportional. 


EXERCISES 


10.1 A set of 20,000 records are stored in 400 file drawers, each containing 
SO records. Ina two-stage sample, five records are drawn at random from each of 
80 randomly selected drawers. For one item, the estimates of variance were 
Sj = 362, s? = 805, as defined in section 10.4. (a) Compute the standard 
error of the mean per record from this sample. (b) Compare this with the 
standard error given by the approximate formula (10.16) in section 10.4. 

10.2 From the results of a pilot two-stage sample, in which m’ subunits were 
chosen from each of n' units, it is useful to be able to estimate the value of 


V(9) that would be given by a Subsequent sample having m subunits from each 
of n units. Show that an unbiased estimate of V(¥) is 


^ N-ns? sg? m mn mn 
V@) = (—— J] 4.2 yas LE 
2 ( N E * 4 m * mwN MN 
puted from the preliminary sample. Hint. Use theorem 


mn m 
where s;? and s} are com 
10.3 and the result 

P Coe Sê Sg 
E(s,?) Si M 3p m j 
10.3 In sampling wheat fields in Kansas, with the field as a primary unit, 
King and McCarty (1941) report the following mean squares for yield in bushels 
per acre: s,? = 165, s? = 66. Two Subsamples were taken per field. For 4 
sample of n fields, compare the variances of the sample mean as given by (a) the 
sample as actually taken, (5) four subsamples per field from n fields, (c) completely 
harvesting n fields. 
N and M may be assumed large and constant. In (c) assume that complete 
harvesting is equivalent to single-stage sampling (i.e., to having m = M). 
10.4 In the same survey, with two subsamples per field, the mean squares 
for the percentage of protein were s? = 7.73, s = 1.43. How many fields are 
Tequired to estimate the mean yield to within +1 bushel and the mean protein 
Percentage to within +4, apart from a 1-in-20 chance in each case? Perform the 
calculations (a) assuming that two subsamples per field are taken in the main 
Survey, (5) assuming complete harvesting of a field in the main survey. 


,105 For the wheat-yield data in exercise 10.3, what is the value of c;/cs in # 
linear cost function if the estimated optimum m is 2? 
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10.6 If m/M and n/N are both small and the cost function is linear, show that 
m = 2 gives.a smaller value of V(g) than m = 1 if 


10.7 A large department store handles about 20,000 accounts receivable 
per month. A 2% sample (m = 400) was verified each month over a two-year 
period (n — 24). The numbers of accounts found to be in error per month (out 
of 400) were (in order of magnitude) 0, 0, 1, 1, DE AVATS, 5955362 646 7:478; 
9, 9, 10, 10, 13, 14, 17, the time pattern being erratic. From the results in section 
10.5, compute sı? and s. Hence compute the standard error of ñ, as an estimate 
of the percentage of accounts that are in error over a period of a year, that would 
be obtained from verifying (a) 1200 accounts from a single month, chosen at 
random, (5) 360 accounts from each of four random months, (c) 100 accounts 
each month. Hint. Either use the formula in exercise 10.2 with m’ = 400 or 
obtain unbiased estimates of S;? and S? and use theorem 10.3. 

10.8 In planning a two-stage survey it was expected that c/c would be about 
4 and that S,2/S,,2 would lie between 5 and 50. (a) What value of m would you 
choose from Table 10.2? (b) Suppose that after the survey was completed it was 
found that c,/cy was close to 8 and S,?/S„? was about 25. Compute the relative 
precision given by your 7n to that given by the optimum m. (c) Make the same 
computation for c;/cs = 4, S$?/S,? = 100. 

10.9 Ifpis the correlation coefficient between second-stage units in the same 
primary unit, prove that 


lee Sè ES 
p [NDN] S SiM S 


(This establishes a result used in section 10.6.) 

10.10 Show that if S„? > 0, in the notation of section 10.6, a simple random 
sample of n primary units, with 1 element chosen per unit, is more precise than a 
simple random sample of n elements (n > 1, M > 1). Show that the precision 
of the two methods is equal if n/N is negligible. Would you expect this intuitively ? 
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CHAPTER 11 


Subsampling with Units 
of Unequal Sizes 


11.1 INTRODUCTION 


In sampling extensive populations, primary units that vary in e 
encountered frequently. Moreover, considerations of cost often =a this 
the use of multistage sampling, so that the problems discussed m ES 
chapter are of common occurrence. If the sizes do not vary state 
method is to Stratify by size of primary unit, so that the units a ES 
Stratum become equal, or nearly so. The formulas in section t 
then be an adequate approximation, Often, however, substantial i v 
ences in size remain within some strata, and sometimes it is advisab vi i 
base the stratification on other variables. In a review of the British Soi 
Surveys, which are nationwide samples with districts as haai A 
Gray and Corlett (1950) point out that size was at first included a ae 
of the variables for Stratification but that another factor was found m = 
desirable when the characteristics of the population became better n: 

Some concentrated effort is required in order to obtain a good peo 
knowledge of multistage sampling when the units vary in size, p ual 
the technique is flexible. The units may be chosen either wit TEL 
probabilities or with probabilities proportional to size or to some E A 
of size. Various rules can be devised to determine the sempione bp 
subsampling fractions, and various methods of estimation are P. "hic 
The advantages of the different methods depend on the nature o 


t are 
Population, on the field costs, and on the supplementary data tha 
at our disposal. 


The first part of this cha 
methods that are in use. 


of a single Stratum. The 
as in preceding chapters, 


EET 
pter is devoted to a description of the uer. 
We shall begin with a population that co sd 
extension to stratified sampling can be DUE 
by summing the appropriate variance form 
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Over the strata. For simplicity, we assume at first that only a single 
primary unit is chosen, that is, that n = 1. This case is not so impractical 
as it might appear at first sight, because when there is a large number of 
Strata we may achieve satisfactory precision in estimation even though 
n, = 1. Ina series of monthly surveys taken by the U.S. Census Bureau 
to estimate numbers of employed people, the primary unit is a county 
or a group of neighboring counties. This is a large unit, but it has adminis- 
trative advantages that decrease costs. Since counties are far from uniform 
in their characteristics, stratification is extended to the point at which 
only one is selected from each stratum. Consequently, the theory to be 
discussed is applicable to a single stratum in this sampling plan. 

As in preceding chapters, the quantities to be estimated may be the 
population total Y, the population mean (usually the mean per element 
Y), or a ratio of two variates. 

Notation. The observation for the jth element within the ith unit is 
denoted by yj. The following symbols refer to the ith unit: 


Population Sample 
Number of elements Mi gm 
Mean per element Y, a: 
Total Y; = Mi Y, y; = mili 


The following symbols refer to the whole population or sample: 


Population Sample 
N n 
Number of elements M= X Mi Xm 
N n 
Total Y- XY, Xu 
Mean per element Y -YM, ¥ = XwllXm 
Mean per primary unit Y-YNN g$-wn 


11.2 SAMPLING METHODS WHEN n= 1 


Suppose that the ith unit is selected and that it contains M, elements, 
9f which m; are sampled at random. We consider three methods of 


estimating Y, the mean per element. 
I. Units Chosen with Equal Probability 
Estimate = 9; = ;- 
The estimate is the sample mean per element. It is biased, for in 
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- "P ice 
repeated sampling from the same unit the average of J; is Y; 3 p 
every unit has an equal chance of being selected, the average of Y; 


fl res foc 
DÈ IoT, Ga) 
i=1 
But the population mean is 
» = 
M;Y, N 
y=, where M = > M, 


i=1 
Hence the bias equals (Y, — Y). Since the method is biased, we shall 
compute the mean square error (MSE) about Y. Write 


y; — Y=@,- ¥)+(¥,- Y) + (Y, — Y) 
Square and take the expectation over all possible samples. All conte 
tions from cross-product terms vanish. The expectations of the square 
terms follow easily by the methods given in Chapter 10. We find 
N 2 N E i3 = 
MSE) = c 9M — md Sel LX qs yy (O, — yy LD) 
Nin M m Nia 


within units — between units bias 
where 


Mi 


1 = 
Kt ipm 4g — Yy 
2i M, i 2s ) 


is the variance among elements in the ith unit. eo 
The MSE of Yr contains three components: one arising from vanat 
within units, one from variation between the true means of the units, an 
one from the bias. 1 
The values of the m; have not been specified. The most common choice 
is either to take all m, equal or to take m, proportional to M;, that is, to 
subsample a fixed proportion of whatever unit is selected. The choice 


of the m, affects only the first of the three components of the variance—the 
component that arises from variation within units 


II. Units Chosen with Equal Probability 


Estimate = 7,,; = E 
Mo 
This estimate is unbiased. Since Jis an unbiased estimate of Y; ux 
product Mj, is an unbiased estimate of the unit total Y, Hence NM A 
. 18 an unbiased estimate of the population total Y. Dividing by Mo 


total number of elements in the population, we obtain an unbiase 
estimate of Y. 
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To find V(j;j), which, of course, equals its MSE, we have 


v NM S 
Em TIAE M 
du Mo 
NM ea NM; E: 
Lg — Y, «( i ,- Y) 
M, G ) M 


Y, 
= M 0 
Now M,Y, = Y; the total for the unit, and Y = NY/Mg, where Y is the 


population mean per unit. This gives 


v NM;,. v N F 
i, —- Y= A gy Y Qu 
Int M, (y, ) M, ) 
Hence 
ns Stu NUN P 
V(y) = X MAM, — m5 —xQ—Y* (12 
(hr) Me 2, à ) m, Mj DI : a) 


The between-units component of this variance (second term on the 
right) represents the variation among the unit totals Y; This component 
is affected both by variations in the M; from unit to unit and by variations 
in the means Y, per element. If the units vary considerably in size, this 
component is large even though the means per element Y, are almost 
constant from unit to unit. Frequently this component is so large that 
jj has a much higher MSE than the biased estimate yj. Thus neither 


method I nor method II is fully satisfactory. 


III. Units Chosen with Probability Proportional to Size 
Estimate = jj; = y; = sample mean 
This technique is due to Hansen and Hurwitz (1943). It gives a sample 
mean that is unbiased and is not subject to the inflation of the variance 
in method II. 
In repeated sampling, the ith unit appears with relative frequency M;/ Mo. 
Hence NM 
ge) ny Y, 
(3m) 2 Me 
Further, 
Im — Y = Gm — Y) + (Y; — Y) 
Average first over samples in which the ith unit is selected. 
E SEV SRE vov 
Bam- Y= (E) e n, - vy 
i Mi m, 
Now average over all possible selections of the unit. Since the ith unit is 
Selected with relative frequency Mi Mo, 
N 


- 1 SUSAN. Eb Boe 
Vn) = L[ro. mu cobi LM — v (11.3) 
0 ge: 3 


i= 


296 SAMPLING TECHNIQUES 


i ises from 
Note that, as in method I, the between-units component pe tig 
differences among the means per element Y, in the er id ki . 
these means per element are nearly equal, this component 1s - 


i ificiall 
Example. Let us apply these results to a small populanorn onne En 
constructed. The data are presented in Table 11.1. There are thre: , 


TABLE 11.1 
ARTIFICIAL. POPULATION WITH UNITS OF UNEQUAL SIZES 2 
2 ED d 
Unit Jii Mi Y; So; Y; F, 
1 0,1 2 1 0.500 0.5 E 
2 1821253 4 8 0.667 2.0 pp 
3 3,3, 4; 4, 5, 5 6 24 0.800 4.0 Tl. 
Totals 12 33 


2, 4, and 6 elements, respectively. The reader may verify the figures ayer E 
Y,, Szi, and Y;. The population mean Y is 33, or 2.75. The unweighte ud 
of the Y, is 2.167 = Y,, so that the bias in method I is —0.583. Its square, 
contribution to the MSE, is 0.340. 


: i our 
One unit is to be selected and two elements sampled from it. We consider f: 
methods, two of which are variants of method I. 
Method Ia. 


Selection: unit with equal probability, m; — 2. 
Estimate: 7; (biased). 
Method Ib. 
Selection: unit with equal probability, m; = 3M;. 
Estimate: j, (biased). 
Method II. 
Selection: unit with equal probability, m; — 2. 
Estimate: NM j¥;/M, (unbiased). 
Method III. 
Selection: unit with probability M;/M,, m; = 2. 
Estimate: 7, (unbiased). 
Method 7, (proportional subsam 
(it may be 1, 2, or 3), but the aver: 
By application of the samplin 
obtain the results in Table 11.2. 


oe of2 
pling) does not guarantee a sample size of 
age sample size is 2. e 
g error formulas (11.1), (11.2), and (11.3); V 


TABLE 11.2 


MSE's or SAMPLE Estimates OF Y 


Contribution to MSE from 
Within Units Between Units 


Bias 
0.145 2.056 0.340 
0.183 2.056 0.340 
0.256 5.792 0.000 


0.189 1.813 


0.000 
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Although the example is artificial, the results are typical of those found 
in comparisons made on many populations. Method III gives the smallest 
MSE because it has the smallest contribution from variation between 
units. Method II, although unbiased, is very inferior. Method Ia (equal 
size of subsample) is slightly better than method Ib (proportional sub- 
sampling). 

Some comparisons of these methods have also been made on actual 
populations. For six items (total workers, total agricultural workers, 
total nonagricultural workers, estimated separately for males and females), 
Hansen and Hurwitz (1943) found that method III produced large reduc- 
tions in the contribution from variation between units as compared with 
the unbiased method II, and reductions which averaged 30 % as compared 
with method I. (They assumed the contribution from variation within 
units to be negligible.) In estimating typical farm items for the state of 
North Carolina, Jebe (1952) reported reductions in the total variance of 
the order of 15% as compared with methods of type I. In both studies 
the primary unit was a county. 


11.3 SAMPLING WITH PROBABILITY PROPORTIONAL 
TO ESTIMATED SIZE 


As mentioned in Chapter 9, the sizes M; of the units are sometimes 
known only approximately from previous data, and in other surveys 
several possible measures of the size of a unit may be available. Let z; be 
the probability or relative size assigned to the ith unit, where the z; are any 
set of positive numbers that add to unity. We still assume 7 = 1. 

Method IV. An unbiased estimate of Y is 


iy = b (11.4) 


This follows because, in repeated sampling, the ith unit appears with 
relative frequency z;, so that 


m 


N 7] = 
Mayr > Mik; £7 
2,Mo = 


N 
E(w) = r 2 
hv) 2? i=1 Mo 


The variance of j,y is obtained in the usual way. Write 


[by (11.4)] 
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i i i e 
In the variance, each square receives a weight z; Henc 


7 v PNE 
2 1 [XM(M; — m) S? , XN (Ma y 7) | (11.5) 
VG) = mae E E 2A x o i 
Jj = initia 

If z; = M;|Mg, (11.5) reduces to (11.3) for V(gg). If z; = 1/N Ha 
probabilities equal), (11.5) reduces to (11.2) for the variance of the un 
estimate when probabilities are equal. : x d 

Unless 2; = M;/Mọo, the between-units component in (11.5) is wA 
to some extent by variations in the sizes M; as well as by variations : 
means per element Y;. 


TABLE 11.3 
COMPUTATION OF V(ry) 

152 
MM; — mj) Yi- iy 

Unit M; MjM, 2 m, —————- Sg y, TY ACE 

zm; i 

à 9. QT o2 o> 0 0500 1 5 E 
2 4 033 (04 2 10 0.667 8 20 n- 

3 C OW yi 2) 30 0.800 24 60 + 
OESS 2 19430! 1,740800, 24 1605: 1271 


: E: in the 
Example. Table 11.3 shows the computations for finding V(jrv) d hor 
artificial population in Table 11.1. The 2; have been taken as 0.2, 0.4, an 
and the m; — 2. 


From (11.5), the variance comes out as follows: 


ZiM; 


n $0 2 
within-units contribution = > MMe ms / My = 0.213 


2 
between-units contribution — A 2 CR Y) / Mg = 3.583 
zi P 
i 


jance 

Comparison with Table 11.2 shows that method IV has a lower Me 

than the unbiased method II in Which the primary unit is chosen Ho 
equal probabilities, but method IV is decidedly inferior to method 


on o 
method III. In this example method IV pays too high a price in-order t 
obtain an unbiased estimate. 


Consequently, 


an n sin 
it is natural to consider whether the sample mean (a 
method I) woul 


: F V. 
d be better than the estimate adopted in method I 
V. Units Chosen with Probability Proportional to Estimated Size 


Estimate = Jy = y, = sample mean 
The estimate is biased, since, for example, 


E(y) = D zY, = Y; 
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If the z; are good estimates, Y, is close to the correct mean Y = }M,Y,/M, 
and the bias is small. 
If we write 
yy — Y-(g— Y) + (Y, — Y)-(Y,— Y) 
the three components of the MSE work out as follows: 


N 2 N 

T AZ(M, — mj) S v y v y 

MSE(gy) = Y M md Set SaF, — F)? + y, — yy 
il M; m, ii 

Example. If the values of z; and m; are chosen as in Table 11.3, the reader 

may verify that the components of the variance of y are as shown in Tabie 11.4. 


TABLE 11.4 
CONTRIBUTIONS TO THE MSE IN METHOD V 
Within Between Bias Total 
Units Units MSE 
0.173 1.800 0.062 2.035 


This is superior to all methods except method III (pps) and is almost as good 
as method III. 


The variances of these five estimates could have been found as particular 
cases of theorem 10.1, but with n = 1 they were easily found directly. 
11.4 SUMMARY OF METHODS FOR n — 1 


The five methods of estimating the mean per element Y and their 
MSE's in the numerical example are summarized in Table 11.5. 


TABLE 11.5 
Two-STAGE SAMPLING METHODS (n = 1) 
Probabilities in Estimate Bias MSE 
Method Selecting Units of Y Status in Example 
I Equal Vi Biased la: 2.541 
Ib: 2.579 
NM; A 
II Equal aj. Unbiased 6.048 
Mi E t 
III — c size Ji Unbiased 2.002 
M, 
Mj : 
IV z; cc estimated size EL Unbiased 3.796 
z; M 
M z; c estimated size 9; Biased 2.035 


e NS ROTE SE CHR. I oS T TX Won 


300 SAMPLING TECHNIQUES 
11.5 SAMPLING METHODS WHEN n7 1 


The principal sampling methods for n > 1 are natural extensions] 
those discussed in sections 9.8 to 9,12 for one-stage sampling from c d 
units of unequalsizes. Consequently we can use both the variance en E. 
developed in these sections and the comparisons made betwee 
methods. ? 

In the following sections the formulas for the true and estimated M 
for the most useful methods when n > | are presented. For each met E 
the conditions under which the estimate becomes self-weighting are noted, 
in view of the practical convenience of self-weighting procedures. 


11.6 UNITS SELECTED WITH EQUAL PROBABILITIES. 
RATIO-TO-SIZE ESTIMATE 


F Y ible 
For estimating the population mean Y, there are several possib 
extensions of method I. The one that seems most generally useful is 


n 
^ X M jj, 
Y, == 
È M: i 
ae , - , e 
This is a typical ratio estimate because both the numerator and 3 
denominator vary from sample to sample. As is characteristic of rà 


estimates, the estimate is biased, but the bias becomes negligible when 7 5 
large. To find the approximate MSE, write 


XM& y XMG-Y) XMG-Y 
» M, SM, nM 


where M = M,/N is the average size per primary unit. 
To apply theorem 10.1, p. 272, write 


Y Y Y n 

, My; — Y) Ü MAY, — Y) < , as 
(=v MOK psy, m 

y nM nM 2 N 


It follows that Y’is the unweighted mean of the variates M;(Y; — yyn- 
Hence, by theorem 2/0. 


X P. 
yy s LEA EMMY, - Yy 
nM? N—1 
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Further, in the notation of theorem 10.1, 
1 


oy? = Ely; — Y = EZ E[M? (y; — Y] 
1 


Meh = Sai) Soe? 
nM? m; 


Hence, by theorem 10.1, 
z N 
MSE(Yp) = V(y') = V(f^) + È mo 


N EN 

LI-AEMAX-Y*, 1 ZMU- HSE qug 

nM? N—1 Ki m m; ts) 

This estimate reduces to the sample mean, that is, becomes self-weighting 

if 
Soi = mi — constant = = = fa (say) 
5 M 

In this event the within-units contribution may be expressed more simply, 

giving 


N 


.1-AXMAÁYX-YY 1—RA5 (Mss: 
MSEGD Sn QU cRNA TRA x (s tet 


The resemblance to the corresponding formula when the primary units 
are of equal sizes may be noted. From (10.10), section 10.3, we had 


XY Y ae N 


The difference is that in (11.7) the contributions from the primary units 
to the MSE are weighted. 

An approximate sample estimate of the MSE in (11.6) is given by 
theorem 10.2. For the between-units component V( 1") the usual sample 
estimate (subject to a bias of order 1/n), is 


X?) = Lt L MAT, vy (11.9) 


n—1 
where Y, — * M, YJY M;. The “copy” of this is 
^ = 1-f > MG. — YR 11.1 
v (f^) = ug PE (11.10) 


noting that f. = YMjgj/YM, is the copy of Y 
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For 652, an estimate is 


1 Mł — AH 


6. m 
ap = —— 
" nM? m. 


i 

where, as usual, = 
oe EDX CS — 4) 
2i m; Ft 1 


Hence, by theorem 10.2, a sample estimate of the MSE is 


Fa) = w(y') = oP") + Sq? 


-L-&EMMR- A g MAL fds? qnin 
va n—i nM? m; 


te 
By a more detailed analysis, Sukhatme (1954) has given a more accura 
estimate of the within-units component. 


Fora self-weighting sample, (11.1 1) simplifies to 


P-A EMI- P A-A yy os (11.12) 
X) = nM? "ND n'nM È Mise 


In both (11.11) and (11.12 


: c T. imated 
) note that if fi is negligible the estimat 
variance reduces to its first 


term. 


ted 
Example. From the volume American Men of Science, 20 pages were selee 
at random. On each page the ages of two scientists, from two biographies age 
selected at random, were recorded. The total number of biographies per P. 


i ri 
varies in general from about 14 to 21. Estimate the average age and its standa! 
error from the data in Table 11.6. 


From the extreme Tight column, 


F -2 Mð _17,121.5 
UE 
R ŠM, 359 47.7 years 


Since n/N is negligible, we have from (11.11), 


nM*(n — 1) 
The numerator is most easily computed as 
2 (Mii —21Y,Y Gag), + PS M? 
7 15,375,020 — (95.3844)(309,747.5) + (2274.55)(6481) = 571,300 
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TABLE 11.6 
AGES OF 40 SCIENTISTS IN American Men of Science (n = 20, m = 2) 
Ages 
Unit Total 
No. M; Ya Vio Yi M; 
E oc pn d uer Messa ME 
1 15 47 30 77 577.5 
2 19 38 51 89 845.5 
3 19 43 45 88 836.0 
4 16 55 4l 96 768.0 
5 16 59 45 104 832.0 
6 19 39 38 77 731.5 
7 18 43 43 86 774.0 
8 18 49 51 100 900.0 
9 18 45 35 80 720.0 
10 18 46 59 105 945.0 
11 20 71 64 135 1,350.0 
12 18 35 46 81 729.0 
13 19 6l 54 115 1,092.5 
14 19 45 87 132 1,254.0 
15 18 31 38 69 621.0 
16 16 64 39 103 824.0 
17 16 63 47 110 880.0 
18 19 36 33 69 655.5 
19 19 61 39 100 950.0 
20 19 54 34 88 836.0 
Totals 359 1,904 17,121.5 


Since M = 359/20, as estimated from the sample, this gives 


E (20)(571,300) 


XY, = “Tyas = 497 


s(Y,) = 2.16 years 


When primary units are selected with equal probabilities, an alternative 
estimate of the population mean is 


“(it d+ +a) 


This estimate is self-weighting if m, — constant, as in the preceding 
example. When M, and Y; are uncorrelated, this estimate may be satis- 
factory, but it is liable to a bias that does not vanish even with n large 


When M, and Y, are correlated. 
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11.7 UNITS SELECTED WITH EQUAL PROBABILITIES. 
UNBIASED ESTIMATE 


The unbiased estimate (method II in section 11.2) is 
$ INN P il e 
Y, — Ma: XEM 
nM, 


To find the variance, separate the error as usual into the within-units and 
between-units components, by writing 


z = fk xu = i} Ae v 
— v I (au ey: s .— Y 
Y,—-Y : = > MAG; — Y) + a 20 ) 


Where we have used the facts that Y; = M;Y, and that Y Y/nM = 


Y|M = Y. By squaring and taking the average, we find (putting the 
between-units component first) 


vÈ) = E QUE mA 1_ S MÈU — f). 


= (11.13) 
N=1 am m; 


Like the ratio-to-size estimate, the unbiased estimate becomes self- 
weighting if f; = m,/M, = constant = = fa. We then have 


$ 1 2My,; 
Ee DAE (11.14) 
nli ^ f. M, m Su 


i=1 j=1 


With a self-weighting estimate, the variance in (11.13) can be expressed as 


WP) = POEN Thm Mis? (11.15) 
— nm 


For an unbiased estimate of (11.13) from the sample, the usual pro- 
cedure leads to the formula 
«(Y)2! Taf Bg y, A y MA = feds (44 16) 
nM = nA m; 
where Y, = EMgin. 
For the self-weighting estimate (11.14), this reduces to 


y = 1 "n Š (My, Y +40 AU — fr) < 
Y)- = we BS 11.17) 
Xr) nM? n—i nmM Mes? S 
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Exampie. For the data in Table i1.6 the unbiased estimate requires a 
knowledge of N (the number of pages) and M, (the number of biographies in the 
book). N is 2823 and Mj is given as about 50,000. Accepting this figure for 
illustration, we have 


$ 2823 
« = q9y50,099j (17121-5) = 48.3 years 
From (11.16), with M = 50,000/2823 = 17.712, 
4 1 Sem 
Y,) = uu (577.5)? + +++ + (83609 — 577 | = 6.02 
2c ss 4 dn 20 : 


The s.e. of the estimate is 2.45 years. 


11.8 UNITS SELECTED WITH PROBABILITY PRO- 
PORTIONAL TO A MEASURE OF SIZE. 
UNBIASED ESTIMATE 


Primary units are selected with probabilities proportional to z;. Selection 
with replacement is assumed for simplicity. Results for z; = M;/My, 
(probability proportional to size) follow as a special case. 

The subsample of m; subunits from the ith unit is assumed to be drawn 
without replacement. If the ith unit is drawn twice, we suppose that the 
whole subsample is replaced, and a new independent drawing of m; 
subunits, again without replacement, is made. 

An unbiased estimate of the population mean (extension of method IV) 
is 
(11.18) 


To find the variance, write 


ES cm 1 &Mj(y Y) IP aa (z ) 
-yļY=— yau u > [cy ais 


d 


The between-units component may be written 
N A 
nor) 
nM, 2 
Its variance, V}, is obtainable from theorem 9.3 as 


N - 2 
u-- 33 (4 v) (11.20) 


Zi 


The within-units contribution to the variance for a unit that is drawn once 


is, by theorem 2.2, 
1 M? = fao Ss? 


2 
nM 2m; 
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Tf the unit is drawn f; times, each drawing contributes the same amount, 
since successive drawings are independent. This gives 


X UMP — foi) So? 
V,{|pu) = 1 Stat AN 


mM, zm; 
Hence, F 
1 Mi — foi)So? 
= EIV, SN Be (11.21) 
Va = Edu) nM I zm; 


There are other ways in which the subsamples may be drawn. If the 
ith unit is selected /; times, one variant is to draw a subsample of size 
m;t; without replacement, provided, of course, that M; > mit; This 
method is more precise but will be slightly costlier, since more subunits 
have to be measured. Sukhatme (1954) has shown that the within-units 
contribution to the variance for this method is 


n—1XN 
Y= M,S2 (11.22) 
nM? 2 : 
where V is as given in (11.21). 

Another possibility is to draw a single sample of size m, no matter 
how many times the ith unit is selected. This sample receives a weight 


1; in making the estimate. The within-units contribution to the variance 
is found to be 


ANG ei EE 2 
V, + n L SM (1 = foi) Soi 
. nM, m; 

The differences in precision among these three methods may be shown to 
be small if the over-all sampling fraction is small. 


To continue with our first method of subsampling, we have from (11.20) 
and (11.21), 


$ ü Sh Pe Ya tl QAAE 
VW(Y,,.) = A——Y|-—— Y Jem)?» — (11.23) 
(Yooes) a2 E "b Fr, ( 


To discover when the estimate becomes self-weighting, write 


The necessary condition is therefore 


Mor constant =~ (say) (11.24) 
zm; fo 


So that the estimate becomes DDQislfoMo. The quantity f, may be defined 
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as the expected over-all sampling fraction. For the expected number of 
subunits in the sample is - 


n N N 
eS vai) = ES im) =n > zm; = foMo 
using (11.24). 


From (11.24), m,/M; = folnz; If fo is chosen in advance, the field 
worker can be told what subsampling fraction m,/M; to take before he 
goes to the primary unit. For example, suppose that an over-all sampling 
fraction of 2% is aimed at, so that fọ = 0.02, and that n = 60 primary 
units have been selected. If z; = 0.0026 for one unit, we must have 
m[ M, = 0.02/(60) (0.0026) or 1 in 7.8. 

An unbiased estimate of V(Y,pes) is the simple expression 

Y 1, < ' gy 
(Yes) = n — DM Ey) (11.25) 
where y; = M,g,/z, and 7’ is the unweighted mean of the y;. For a self- 
weighting sample, y; = ny;[fo, where y; is the sample total in the ith unit. 

Proof. Theorem 9.5, section 9.10, showed that if the Y, = M;Y; are 
known, an unbiased estimate of the between-units contribution (on 
dividing by M?) is 


^I (YO Y Ic 
amg UT meint 
n(n — 1)M$ ^ Ls z 2 mic E 
Hence we write, in the usual way, 


prm i A (Z) + ln = B X Ü T 9] 


The within-units component of (y; — 7), for a fixed set of primary 
units, is 
nz XA = foi) Soi" 
n zem, 
Its over-all average is 
: © MK — fua)Se? 
E[Eoc - a] 2 - »2 225. 
Hence, on dividing by n(n — 1)Mo^, we obtain the correct within-units 
contribution to 26 29) in (11.23). This establishes the result (11.25). 
With a self-weighting sample, (11.25) takes the simpler form 
v n S gy 5 
a smt 11:25" 
v( Yes) (n — Df, Ms 2 C ) (11.257) 


where y; is th e sample total in the ith unit. 
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11.9 UNITS SELECTED WITH PROBABILITY PRO- 
PORTIONAL TO SIZE. UNBIASED ESTIMATE 


If z; = M;[M,, the unbiased estimate (11.18) reduces to 


> 


sos == (+ tat D) D) 
n 


~ 


Clearly, this estimate becomes the unweighted sample mean per subunit 
Jif m; = m. 

From (11.23), the variance is 
ES N N M, 1 — f, 


(LA TY MNT Sepp 
AOE ia = CA i —t —— #2 §, (11.27) 
Uo) n?M, d ) n7 M, deus 


and from (11.25), since y,’ becomes M. oJ;, an unbiased estimate of variance 
is 


i 


1 n 


v Y = 2 [07 dad Y, (11.28) 
n(n — 1) 
If m; = m, this may be written 
* 1 n 
(0 457 => ——___ -— gy (11.29) 
(Y,,,) WT Dmi 2 V: y) 


where y; = my, = sample total in the ith unit. 


11.10 UNITS SELECTED WITH PROBABILITY 
PROPORTIONAL TO A MEASURE OF SIZE. 
ESTIMATE: RATIO TO SIZE. 


Yr LM. £ (11.30) 
D Milz; 


The numerator is an unbiased e 
unbiased estimate of nM, The 
mean j if m,/M, = folnz 
estimate. 


stimate of nY, and the denominator an 
estimate becomes the unweighted sample 
» the same condition that held for the unbiased 
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Since this estimate is a particular case of the more general ratio estimate 
discussed in section 11.14, the variance formulas will be proved there. 
Assuming n large, 


$ ETEMA ee 1 X MÈU —fi)So2 
V(Yrones) = E n drm i 2i) 92i 11.31 
( Ropes) nME z; ( ) nMë E (a ) 


The estimate of variance (slightly biased) is 


1 2 [M; 5 2 
x; = ———,, — (gy, — Y, 11.32 
MU emer ri [E m-fuo| ^am 


When the sampling is self-weighting, this can be written 


ee a UNE EE SR 11.33 
40) = due nb (11.33) 


11.11 COMPARISON OF THE METHODS 


In section 9.12 the precisions of the following three sampling plans 
were compared for one-stage sampling with units of unequal sizes: 


Selection cf units: equal probabilities. Unbiased estimate. 
Selection of units: equal probabilities. Ratio-to-size estimate. 
Selection of units: probability proportional to size. Unbiased estimate. 


In two-stage sampling the conclusions drawn in section 9.12 remain 
valid for the between-units contribution to the variance, because this 
contribution is the same as the variance for the corresponding one-stage 
plan. To summarize from section 9.12, it was found that if Y, is uncor- 
related with M;, or changes only slightly as M; changes, the pps estimate 
and the ratio-to-size estimate are superior to the unbiased estimate. The 
superiority may be great if the M; vary substantially. On the other hand, 
the unbiased estimate wins if unit totals Y; are uncorrelated with M;. 

The relative performances of the ratio-to-size estimate and the pps 
estimate depend on the relation between the variance of Y; and M;. If 
V(Y;) is proportional to M; 7, the pps estimate is more precise if g < 1 
and less precise of g > 1. The condition g < 1 probably approximates 
the situation in the majority of applications. 

For each sampling plan in section 11.5 to section 11.10, the self- 
weighting form of the estimate was given. Unless the within-unit variances 
S, differ greatly from one another, the use of a self-weighting plan should 
not incur any material loss of precision. As we have seen, the choice of 
the m, affects only the within-units contribution to the variance. As 
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shown by (11.6) and (11.13), the within-units contributions are approxi- 
mately the same for the ratio-to-size estimate and the unbiased estimate, 
that is, 


N 2 N 2c 2 N 
Vo = L 5 Me — fa)Ss = l2 M; Soi a > MS) 
nNM? m; nNM m; 
If the m; are chosen to minimize V, for a fixed total sample size Zm,, we 
find m; occ M;S,,. The self-weighting estimate requires m; oc M;. For the 
pps estimate, the reader may verify that the minimum V, is given by 
choosing m; oc Sp, whereas the self-weighting plan has m; = constant. 
In comparing the within-units variances for the different plans, we 


assume that the self-weighting forms are used. From (11.7) and (11.27) 
the V, terms are as follows: 


N Es 
equal: V, = Hs ( — s) Sei 
D 


MH f REMIS S 
mm: Jt 
The two expressions differ onl 
bilities, the fpc term is the sa 
7| M, is smaller in the larger 
the larger units. Since 
Pps sampling gives perh 
subsampling fractions t 


y in one minor respect. With equal proba- 
me in all units, whereas with pps sampling 
units and therefore (1 — m/M,) is larger in 
Sa? is often greater in large than in small units, 
aps a higher within-units contribution, With the 
hat are common in practice, however, the difference 
should be trivial. In the example in Table 11.2, section 11.2, the V 


contributions were 0.189 for Pps sampling and 0.183 for the self-weighting 
form of the ratio estimate (method I). 


In a comparison of the three 
contribution is therefore to dilut 


nstance, if the between-unit contributions 
for two plans are V, — 2 and V; = 1 and the within-unit contributions 


SUBSAMPLING WITH UNITS OF UNEQUAL SIZES 311 


be known for the n primary units that are in the sample, and pps sampling 
demands a knowledge of all the M; in the population. For estimating 
the population mean, the unbiased estimates, with either equal proba- 
bilities or ppes, require a knowledge of Mọ, the total number of subunits 
in the population, whereas the corresponding ratio-to-size estimates do 
not. For estimating the population total, the situation is reversed. 


11.12 RATIOS TO ANOTHER VARIABLE 


In two-stage sampling the quantity to be estimated is often a ratio Y/ X. 
This happens for two different reasons. As mentioned previously, if x is 
the value of y at a recent census, the ratio y/z may be relatively stable. 
An estimate of the population total or mean of y that is based on this 
ratio may be more precise than the estimates considered in this chapter. 
This was found to be the case for sampling farm items in North Carolina 


[(L. H. Madow (1950); Jebe (1952)]. 
Ratio estimates of this type are encountered also in the estimation of 


proportions or means over parts of the population. In an urban survey 
with the city block as primary unit, an example of a proportion of this 
type is 
number of employed males over 16 years 
total number of males over 16 years 


If y; = 1 for any employed male over 16 and y;; = 0 otherwise, and 
x; = | for any male over 14 and x; = 0 otherwise, the population 
proportion is Y/X. Other examples for this type of survey are the average 
income of families that subscribe to a certain magazine or the average 
amount of pocket money per teen-age child. 


11.13 VARIANCE OF THE RATIO WITH EQUAL 
PROBABILITIES OF SELECTION 


Formulas for the MSE and estimated variance are easily found from 
results already established. Consider first the selection of units with equal 
probabilities. The ratio estimate is 


Now, with R = ¥/X, 
O ARE hen ee 
R—R EMG R9). LY M(g,— Rz) 
> Mg 
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n s 
where we approximate as usual by replacing * M,Z; in the denominator 
by its expected value nX/N or nX. 


Let di; = yi; — Rv; By the definition of R, the population total D Eno 
the population mean per subunit D both vanish. With the ratio-to-size 
estimate (section 11.6), the approximate error of theestimate was expressed 


as > MQ; — Y)nM. With the present ratio estimate, the approximate 


error may be written as $ Md; — D)ÍnX. Hence variance formulas for 
R are obtainable from those in section 11.6 by replacing y;; by d;; and 
multiplying by (M/ X)? 

For the true MSE, this gives, from (11.6), 


N 
_1-f XMXY,- RX)? 1 _ AMÈU — fa) oo 4 
MSE(R) = —h AMN RXY | 1 MAO = fod) ge (11.34) 
F(A) n? N-1 un m, a 
where 
n ] Mi = Fuer 
N Shi = REA PAC — Rz; — (Y; — RX)Y 


If fo; = m,|M, = fa = constant, R reduces to the ratio of the sample 
totals XXy,|YYx.. The MSE then takes the form 


N M.» 
i s?a (11.35) 
2 i Se 


0 


N 
1 — f, > MAY; — RX) 1—f, 
MSE(R) = 1— 4 is i J2 
e nx? N—1 D nmx? 
For the estimated variance, substitute d, 


t i for y;; and X for M in (11.11) 
for v(Y,). The resulting expression contains R. Substitute R for R, 


noting that the term Y , in (11.11) becomes zero. This gives 


v(R) m- 1 mui Xu, E Rzy ae ist 


y MEU — fods: 
— —L T aiai (11.36) 
nx? n—1 na» m; ( 


11.14 VARIANCE OF THE RATIO WITH ppes SELECTION 


If primary units are selected w 


ith probabilities proportional to z; with 
replacement, the estimate is 


(11.37) 
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The numerator and denominator are unbiased estimates of n Y and nX, 
respectively. To find the variance, write 


ID RIS iw. — Rzj[z xt 1 » My; — Rzj) 


>) Mile; is 4 

On comparing the approximate error of R with that of the unbiased 
estimate T (section 11.8), it follows that V(R) can be derived from 

+ 
V(Y,,,.) by replacing y,; by di; = y;; — Rz; and M, by X. From (11.23). 
this gives 

N MAI — £. 9S. 

1 SMP = ADS. (11 38) 


hee 
V(R) = — >-(% — RX)? + — 
«) 2i í ? taxi zm, 


The estimate R reduces to the ratio of the sample totals if 


M; n 
— = constant = — 
zm; 0 


this condition being the same as that for p 
From (11.25), a sample estimate of V(R) that is slightly biased is 


= nga S uL nu 
oR) = Mm pe” Raj) (11.39) 


where y; = Mlz; and x,’ = Mz; 


11.45 CHOICE OF SAMPLING AND SUBSAMPLING 
FRACTIONS. EQUAL PROBABILITIES 


This problem is discussed first for the ratio-to-size estimate when units 
are chosen with equal probabilities. The subsampling fraction m;/M; is 
assumed constant, so that the estimate is the sample mean per element. 

The simplest cost function contains three terms: 

c, = fixed cost per primary unit 

cg = cost per subunit 

c, = cost of listing per subunit in a selected unit 
The third term is included because the sampler must usually list the 
elements in any selected unit and verify their number in order to draw a 
subsample. Hence 

n n 
cost = cn + c; > m; + > M; (11.40) 
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This formula is not usable as it stands. since the cost depends on the 
particular set of units that is chosen. Instead, consider the average cost 
over 7 units, wnich equals 
E(C) = cn + cnm + enM = (e, + ¢,M)n + cem = cn + con 
(11.41) 
where c, now includes the average cost of listing a unit. 
From (11.7) in section 11.6, 
N 
MSE(j) = dh È MÁY, 20; HL 1 Zh x Mi Soe 
n M*(N — 1) nm ~My 
Write 


N fe 
S? = à MÁY, E Yy 
M*(N — 1) 
This is a weighted variance among unit means per element. It is analogous 


to the variance S;? in section 10.6 and reduces to S? if all M; are equal. 
We may also write 


N 
Sř = > A S, 
This is a weighted mean of the within-unit variances, Tt reduces to the 
S* of section 10.6 if all M, are equal. . 
In this notation 
MSE(j) = Hsp 2 Sx) sols (11.42) 
n M nm N 
The cost and MSE equations (11.41) and (11.42) are of exactly the same 


form as those in section 10.6, excepi that i; replaces m, S,? replaces 5j? 
and c, includes the cost of listing. Hence, from (10.18), 


LPS TTE "E (11.43) 
VS? — SM es 


The methods given in section 10.6 for utilizing knowledge about the 
ratios S,/S, and ccs to guide the selection of Mop. are applicable here. 


The unbiased estimate when units are drawn with equal probabilities can 
be handled similarly, 


The next section Presents a more general analysis of this problem. 
11.16 SAMPLING AND SUBSAMPLING FRACTIONS 
FOR ppes SAMPLING 


4 An important analysis by Hansen and Hurwitz (1949) shows how to 
termine simultaneously the Optimum probabilities of selection of units 
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and the optimum sampling and subsampling fractions. The analysis is 
presented for ratio estimates R. Units are selected with probabilities 
proportional to z; The subsampling fractions are assumed chosen so 
that Á reduces to EXy;/ZXx;. From section 11.14 this requires m; = 
kMi[z;, where we have used k in place of the previous f,/n. 

As in section 11.15, the cost function is 


n n 
C=c,n + cm; +c, > M: 


This cost function applies only if good preliminary estimates of the sizes 
of all units in the population are available, since listing costs are included 
only for those units that appear in the sample. If the whole population 
has to be listed in advance, pps sampling is seldom economical for a 
single survey unless listing is extremely cheap. 

Since 


n N N 
aS n) = Y nam, = nk $ M; = nkM, 


n N 
z(* M) = » nz,M; 
the average cost of sampling n units is A 
C = c,n + cjnkM, + cn 22M; 
In attempting to minimize V(R) for fixed average cost, the variables at 


Our disposal are n, k and the probabilities z;. Ew > 
By (11.38) in section 11.14, the variance to be minimized is 
a MM; — m;) 
vd = L X [Lac - mx MiMi = màs] 
nx? i eG 
Since d, = Yy — Rz; we may write (Y, — RX) = M,D,. Noting 
further that M,/z,m, = 1/k, we have 


1 N M? 2 Mis -Mi st 
Wy E Dé mc, a 


Combining the first and third terms inside the parentheses gives 


N 1T M? z Sizi M: 
V= x*v(R) = XE (ne = Sa) TU 


Finally, note that n appears only in the combinations nz; and nk. 
Introduce the variables z/- nz, and K' = nk. Thus 

& [Mz pe — Sit) ess | (11.44 

ya > Fs: Q M. K d2i ) 


Bt i 
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The problem is to minimize V with respect to variations in n, k' and the 
2, , Subject to the restrictions that average cost is fixed and that 
N N 
XE mE le, Y z =n 


Taking 4 and y as undetermined multipliers, we minimize 


N N 
V+ Alen + ck'Mo + c, X z'M; — c) T AG — $a) (11.45) 
Differentiation gives 


n: he, -u-0 
TEMA Si 
ge "D (52 - $) ie, -u mo 


that is, 


i €, t cM; 
Since z; = z;'[n and Xe, = 1, it follows that 


s MiDj c, cM; 
=y Nu MM 


(11.46) 
> M,Dj l.c, + «aM; 


where 


Dj? = D? a E 

M; 

and it has been assumed that the D," are positive. Equation (11.46) gives 
the optimum selection probabilities. 

The quantity D,,? 


; must now be examined, since it may depend on the 
size of unit M;. In 


section 9.4 the variance between cluster unit means 
was expressed in terms of the intraunit correlation coefficient. From 
equation 9.7, with n = 1 and N assumed large, the variance among the 
means of a group of primary units is given approximately by 


vV) = z [1 + (44 — 1)pg] (11.47) 


where S? is the variance among subunits in the population and M is the 
average size of the 


: primary units. The intraunit correlation has been 
enoted by Px as a reminder that the correlation may depend on the size 
of the unit. 


Apply this result to an analysis of variance of the variate d;; into the 
categories between units and within units. The symbol S,? denotes the 
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variance among all subunits in the population. Assuming N large and 
M; = M, we have 

Total SS: NMS? 
vee (11.48) 
SS between units: M Y D? = NSl + (M — 1)em] 
Where we have used (11.47) and noted that D=0 by the definition of R. 
Hence, by subtraction, 


SS within units: N(M — 1)S201 — pg) = NM — 1), 
where S? is the variance within units. This gives 
Sa SI — pg) (11.49) 


From (11.48) and (11.49) we obtain the average value of D;,* for primary 
units of size M, that is, 


1 A s S 2 S. 2 Y a 
E(D,,2) = — y pg — 22 —?«.p + (M — lpm — (1 — ex)] = priS: 
adr deir] ie 

If M does not vary greatly, the assumption that pg is constant, hence 
that E(D,.2) is constant, is often satisfactory. In general, however, py 
May be expected to decrease as M increases, since subunits that are far 
apart are less subject to common influences. As Hansen and Hurwitz 
(1949) suggest, the rate of decrease is usually small enough so that Mex 
increases, hence ME(D,,2) increases, as M increases. If pyy is zero or 
negative, many of the quantities D,,* will be negative, and the solution 
given here breaks down. In this situation two-stage sampling is less 
efficient than one-stage. 

We can now discuss the optimum choice of the z; From (11.46), 

ee HD. 

* Je, t eM; 
Since the values of the individual D;, are not known, we replace Di, by 
its average for units of size M; that is, by VEO? | M) = Dur, (say). 
The following deductions may be made. 

1. Suppose that c,M,, the cost of listing per primary unit, is small 
relative to c,, the fixed cost per primary unit. If D, x7, is constant, then 
*; © M, so that pps selection is best. If D, i, decreases with increasing 
Mi, optimum probabilities lie between z; oc M; and z; oc V M. 

2. If the cost of listing predominates, optimum probabilities lie between 


uoc VM, and z, = constant (equal probabilities). 
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3. If costs of listing and fixed costs are of the same order of magnitude, 
z oc 4l M,isa good compromise. 


The optimum k is found by differentiating (11.45) with respect to K'. 
The result is 


på VÈ M:S: 


——————— e (11.50) 
V Mo Y M; Dulcu + cM; 
This result is similar to that obtained in section 11.15 for units chosen 


with equal probabilities. To see this, note that from (11.46) and (11.50) 
the optimum m, = kM;|z; is found to be 


me LEMOS ey, = Sal ESM, 

uM; S, 

In this form the result is the same as that for m 
¢ in (11.43) is c, + c, M. A : 

Finally, the optimum z is found by solving the average cost equation. 


u 


opt in (11.43), noting that 


11.17 STRATIFIED SAMPLING. UNBIASED ESTIMATES 


For the unbiased methods 
forward. The subscript h d 


Ma = 


the extension to stratified sampling is straight- 
enotes the stratum. 


total number of subunits in stratum h 


" 
M, = X M, = total number of subunits in the population 
h 


The estimated population mean per subunit is 
M 

Iu = 2 Wi, W, = —% 
h 0 


where 7, denotes the estimate of the stratum mean per subunit Y,. Further, 


L L 
VGut) = 2 WVD), o(s) = x Wolga) 


These variances are readily obtained from the formulas already given. 
It is of interest to consider the conditions under which the estimates 


become self-weighting. For Ppes sampling (section 11.8) the estimated 
Population mean is, from equation (11.18), 


Y. E => hiYhi 
Mon n, 4 Da, 


SUBSAMPLING WITH UNITS OF UNEQUAL SIZES 319 


where yp; is the total over the m,; subunits taken from the ith unit in 
stratum A. In section 11.8 we saw that the estimate is self-weighting 
within strata if Mj,[m;2,; = "alfon With this substitution, the estimate 


becomes 


Thus the estimate is completely self-weighting if fo» the expected over-all 
sampling fraction, is the same in all strata. 

If units are of the same size within any stratum (i.e., Mp; = Mj), it was 
shown in section 10.10 that sample allocation leading to a completely 
self-weighting estimate is close to the optimum allocation, provided that 
SulN cy, is reasonably constant. A similar result holds here. From 
(11.23), we have 

- E18, e xf eius (i - 0)] 

(rao uae Pr n ed P 
where M,[z,m,; = naffo to make the estimate self-weighting within each 
Stratum. The quantity fo, enters this expression in the form 


DAT Na 
O > MnSani (11.51) 
o A Jon * 
(The term in mp,/fo, arising from the second-stage fpc, may be written 


My,/2,,m, and thus is a term in 1/n, rather than in 1/fo,-) ) 
With a simple cost function, the expected cost may be written 


C = X cy + Xi cnfoMos 
Y h h 


of second-stage units to be drawn 
listing costs have been included in 


(11.52) 


since fo, Mq, is the expected number 
from stratum A. In this cost function 
Cir 

Hence, from (11.51) and (11.52), the variance 
cost if 


is minimized for fixed 


p= e ie Be 
VE Mui -L SF MaM) Shu 
Jen Mos Con 
The term involving variances is a weighted mean of the 
the result. 
The estimated variance is obtaine 
completely self-weighting estimate, 


fon © 
S3,;. This verifies 


d from (11.25) in section 11.8. Fora 
(11.25') shows that v takes the form 


1 ny, 4m 
TM” 2 TESI > (Yni Tr) 


v( Yos) = 


GI IW re NAUES &M;)|c = lm (c, + cMj)es 
Dy yr, Sau " 


In this form the result is the same as that for j| in (1 1.43), noting that 
¢ in (11.43) is c, + c, Az. A 3 
Finally, the optimum n is found by solving the average cost equation. 


11.17 STRATIFIED SAMPLING. UNBIASED ESTIMATES 


For the unbiased methods the extension to stratified sampling is straight- 
forward. The subscript 4 denotes the stratum. 


Mo, = total number of subunits in stratum h 


L 
Mo = È Mo, = total number of subunits in the population 
h 


The estimated population mean per subunit is 


L 
A M, 
Gu = » Win, (LES 
h 0 


where 4, denotes the estimate of the stratum mean per subunit T Further, 


L L 
Vg) = » WVD) = 2 Wiola) 


These variances are readily obtained from the formulas already given. 
It is of interest to consider the conditions under which the estimates 


become Self-weighting. For Ppes sampling (section 11.8) the estimated 
Population mean is, from equation (11,18) 


$ IŁZim 1 
eyes SH > Saas 
Mor nV T2, 


Son/V Con is reasonably constant. A similar result holds here. From 
(11.23), we have 
1$, (Y oe) sal - z=) ] 
V) = ape PX s 7 x T PR X Msn 1 My; 
where M,,[z, my, = Malfon to make the estimate self-weighting within each 
Stratum. The quantity fon enters this expression in the form 
L SLY MSIE (11.51) 
Mo" fos i 
(The term in m;;/fo,, arising from the second-stage fpc, may be written 
My[z, iy and thus is a term in 1/n, rather than in 1/fo,-) , 
With a simple cost function, the expected cost may be written 


C = $ cnm + X ConfonrMon (11.52) 
$ h h 


i n 
since fj, Mi, is the expected number of second-stage rd te Roce 
from stratum h. In this cost function listing costs have be 


iS Hn . soe . 
" Hence, from (11.51) and (11.52), the variance Is minimized for fixed 
Cost if 


VE MnS L JE Mnl Mon) Shni 
V2 MniSani X 


oc B 
fo 0^ J fs Mon Con ; ^ 
i i es 
The term involving variances is a weighted mean ofthe S2,. This veri 


the result, 
The estimated variance is obtained from (11.25) 5 NEL) zi x 
completely self-weighting estimate, (11.25’) shows 


Gus m Em — 9» 
0. h 


v( Yoon) = 
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11.18 STRATIFIED SAMPLING. RATIO ESTIMATES 


With ratio estimates the familiar problem—whether to take a Separate 
or a combined estimate—arises. The separate estimate is preferable if 
n, is large in each stratum and the true ratio is likely to vary from stratum 
to stratum. Its variance formulas follow at once from those for a single 
stratum. ‘ 

For the combined estimate, with Dpes sampling, let 


m IE MIS 
f, M, CH he Thi 
n, i=1 2, Ny, i=1 2, 


The quantities ¥,, €; are unbiased estimates of the stratum totals Y;, Xp- 
The combined ratio estimate is defined as 


m 
X 

=i 
L 
in 

The approximate MSE of R, is found as usual by writing 


R 


c 


L 
R-n-ly($. gg) 
Xx 
The quantity f, — RX, is an unbiased ppes estimate of the stratum total 
Y, — RX, of the variate di = Ynis — Rz,;. Hence, from formula (11.23) 
in section 11.8, replacing Ypi by d,,;, we obtain 


L Na 2 
MSE) = — 23e 22 ») + Mate fads] 


Zhi ZhiMpi 
where 
Dj, = Yau — RX pi 
à 1 My LA = 
Seam = M,—1 2 [Oris — Rz) — (5, — RX DP 
Similarly, the estimated variance of R, is obtained from (11.25) as 
ener I, 
J= x2 mm, D> mi — dy) 
where 
di = Mudri, dy = 4 » dj 
Zhi Ny i 


dri = Tn — Ry 
If X is not known, the sample estimate Z$, is substituted for it. 
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The estimate f, reduces to the ratio of the sample totals of y,;; and 
%,;; if the over-all sampling fraction is the same in all strata. 


11.19 SELECTION WITH UNEQUAL PROBABILITIES 
WITHOUT REPLACEMENT 


In some surveys the strata contain relatively few primary units, say 
5 to 15, of which 2 or 3 are selected in the sample. Thus the primary 
sampling fraction may lie between 10 and 50%. This situation has 
encouraged the search for a satisfactory method of selecting primary units 
without replacement, which might produce the reduction in variance 
associated with a finite population correction. Some of the principal 
methods were described for one-stage sampling in sections 9.14 and 9.15. 
In the numerical example in Table 9.10 (p. 266), with two primary units 
drawn out of five, it is noticeable that in population A, in which Y, was 
uncorrelated with z;, the three estimates Py, Psys, and Ya, gave reduc- 
tions in variance of about 40% as compared with Y,,,, in which selection 
Was with replacement. (These were the estimates obtained by plans which 
did not distort the probabilities of selection.) À . 

In two-stage sampling the gains in precision from selection without 
replacement will be smaller. The sampling variance formulas show that 
this gain affects only the between primary unit contribution to the variance. 
Although the within primary unit variances are not the same for sampling 
With and without replacement, they are of the same order of magnitude. 
Moreover, when the primary sampling fractian is large, the second-stage 
fraction is likely to be small, so that the second-stage variance 1s a major 


Part of the total variance. These considerations suggest that for most 
replacement is not too 


Purposes t| f selection without 
Ip he need for methods of s BONES d 


ipu í MB ‘) -d + 95) 

2\ % 2; 

The „estimate is biased, since this method 

es but the bias appears to be unimportan 

Should be an overestimate of the variance. i f 
The three remaining methods have already been Sees ear aa S to 

arrange primary units in random order, drawing a systematic y 


distorts the selection proba- 
t. The quantity (y; — 9;')*/4 
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sample from the cumulated z,. Each primary unit in which a point ofthe 
systematic sample falls is included in the sample (Hartley and Rao, 1962). 
An unbiased estimate of Y is Xy/[n. As with the Yates and Grundy 
method, the estimate is self-weighting if m,/M, = fo/nz; where f, is the 
expected over-all sampling fraction. 

In the third and fourth methods, the stratum is divided into n groups. 
One primary unit is drawn from each group with probability proportional 
to relative size within the group, that is, to z,/Z,, where Z, = Xz, taken 
over the group (say, the gth) in which the ith unit falls. An unbiased 
estimate of Y is E: ^ 

fos EMD (11.53) 
9 Zi 

In forming the groups, one way is to make Z, constant, as far as possible, 
in order to keep the probabilities of selection proportional to the original 
zı. It helps also if the group means Y, are approximately equal, since the 
estimate of variance uses the method of collapsed strata. 

The fourth method is to assign units to groups at random, with the 
number of units in a group as nearly equal as possible (Rao, Hartley, and 


Cochran, 1962). The estimate f, is as in (11.53). An estimate of variance 
that is unbiased for any n is 


N?—N n n 2 
(Pea) = xn) [Szu - fot] + $ T d fadat 
where N, is the number of units in the gth group, y; = Mj[z,, and 
N — XN, As usual s? = X(y, — yj?/(m; — 1). If N is divisible by 
n, so that N, = Njn, the term outside the square bracket becomes 
a — Mln — 1). 

11.20 SUMMARY COMMENTS 


The efficient design of a multistage sample with primary units of unequal 
size requires a good deal of preliminary work. Selection of primary units 
with probabilities proportional to a measure of size z; is at its most 
effective, relative to selection with equal probabilities, when the ratios 
Y,/z; are uncorrelated with the sizes z; for the principal items in the survey 
and the sizes vary substantially. These conditions hold frequently in the 
sampling of records in which the sizes of the primary units (groups of 
records) are determined by administrative or economic considerations, 
the data in individual records being at about the same level in different 
units. The principal decisions to be made are the following: 


1. Find out whether the sizes are known, known approximately, or 
unknown. In the last case consider whether some information about 
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sizes can be obtained relatively easily. For example, Jessen et al. (1947) 
conducted two-stage samples of blocks in some Greek towns in which no 
usable estimates of the numbers of households per block were available. 
They considered three approaches: (a) Drawing the blocks with equal 
probabilities. (b) Making a rapid tour of the town by jeep in order to tie 
together small blocks to build artificial blocks that appeared to have 
roughly the same numbers of households. Blocks which obviously had 
no households were eliminated in this process. The sample blocks were 
then chosen with equal probability. (c) Cruising the town slowly enough 
to permit estimates to be made of the number of households in each block. 
Blocks were then chosen with probability proportional to estimated sizes. 

2. Consider whether to use size of unit as one of the variables for 
stratification: this is advisable unless it prevents the use of some other 
variable that might give a worthwhile increase in precision. 

3. Decide how the units are to be selected within strata. If sizes are 
known at least approximately, selection with pps, or its square root, will 
often be the best procedure, although this depends on the nature of the 


field costs. eo 
4. Select a method of estimation. For estimating the population mean 


or total, a ratio estimate using the value of the same item at a recent census 
is sometimes very successful, if available. Estimates based on the sample 
mean or weighted sample mean are often more precise than tlie unbiased 


estimates. A 
5. Decide on the sampling and subsampling fractions within strata. 
We have recommended that subsampling fractions be chosen so that the 
estimates are self-weighting within strata. Further control so YA 
sample is completely self-weighting is advisable unless it appears 
accompanied by a substantial loss of precision. 
nd conduct of surveys ' 
unequal sizes are contal 


4 involvin, two-stage 
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More Extensive Populations 


Gray, P. G. and Corlett, T. (1950). Sampling for the social survey. 
Jour. Roy. Stat. Soc., A113, 150-206. 


Hemphill, F. M. (1952). A sample survey of home injuries. Public 
Health Reports, 67. 


Peaker, G. F. (1953). A sampling design used by the Ministry of Educa- 
tion. Jour. Roy. Stat. Soc., A116, 140-165. (A survey of the reading 
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See also the books by Hansen, Hurwitz, and Madow and by Yates. 


EXERCISES 


11.1 By working out the estimates for all possible samples which can be 
drawn from the artificial population in Table i1.1, by methods Ia, 1b, II, and III, 
verify the total MSE's given in Table 11.2. 

11.2 For methods II (equai probabilities, unbiased estimate) and III (pps 


selection), recompute the variances of Y for the example in Table 11.1 when 
m, = l. Show that the precision of method III in relation to method II is lower 
for m; = 1 than for m; = 2. What general result does this illustrate? 

11.3 For the population in Table 11.1, if the estimated sizes z; are 0.1, 0.3 
and 0.6, with m; = 2, show that the unbiased estimate (method IV) gives a 
smaller variance than pps sampling. What is the explanation of this result? 

11.4 The elements in a population with three primary units are classified into 


two classes. The unit sizes M; and the Proportions P; of elements which belong 
to the first class are as follows: 


M; = 100, M, = 200, M, —300, — P, = 0.40, P, = 0.45, P, = 0.35 


For a sample consisting of 50 elements from one primary unit, compare the 
MSE's of methods Ia, II, and III for estimating the proportion of elements m 
the first class in the population. (In the variance formulas in section 11.2, Si 
is approximately P; Qi) 

11.5 A sample of n primary uni 
each chosen unit, a constant fractio. 
m; subunits in the ith unit fall in 
(section 11.6) of the 
formula (1 1.12), 


ts is selected with equa] probabilities. From 
n fz of the subunits is taken. If a; out of the 
class C, show that the ratio-to-size estimate 
population proportion in class C is p = Xaj/m, From 
Show that an estimate of MSE(j5) is 


n 
pa E fi MAp =P AA SDS Mam, 
EA VOD Ami 
where p; — a;|m;. 


11.6 A firm with 36 factories decides to check the condition of some equip- 
ment of which M, 


= 25,012 pieces are in use. A random sample of 12 factories is 
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take: ? i i 
is d oases peng checked in each selected factory. The numbers 
PO cibi tm 1;) and the numbers found with signs of deterioration (a; 


Factor m; f, ie e a 
ha T Factory m; 4 Pi = = 
[LEE 

, E 8 012 1 g5 18 0212 

5 82 21 0.256 8 73 11 0.151 

1 52 4 0.077 9 50 7 0.140 

2 91 12 0.132 10 76 9 0.118 

d 62 f 0.016 11 64 20 0.312 

69 3 0.043 12 50 2 0.040 


Estim 
ate the percentage and the total number of defective pieces in use and 


iv i > 
B ^ i MN of their standard errors. 
. Since M,/M = m;[m 
SEENOBR ,[ M. = m;[mi, the between 


-units component of v(p) may be 


i= 
fi (Xa? - 2p > aimi + p> me), 


nm*(n — 1) 
and si 
since the m; are fairly large, the within 


Ad -PS agi 


(ım) 


-units component as 


11. ; : ieee : 
zs If primary units are selected with equal probabilities and f, is constant, 
PEOPUNS in the notation of exercise 11.5 the unbiased estimate of a population 
ion is p - NX a;|nMofa and that, if terms in 1/m; are negligible, its 


vari 
Tlance may be computed as 
OLE Ay 
aiqi- 


v(p) = ci dS -a +e 
Cal n(n — Dm? $ (nm)? 
ale i 
ae p and its standard error for the data in exercise 11.6. 
to esti a sample of n primary units is chosen with probabilities proportional 
sam MESA sizes z; (with replacement) and with a constant expected over-all 
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and TMo| 2 mo Where Tis the sample 
the unbiased estimate can be used, but 
population mean per subunit, the 
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yi ion is reversed.) 
blocks In a study of overcrowding in 2 large city one stratum contained 100 
Size QE which {0 were chosen with probabilities proportional to estimate 
ith replacement). An expected over-all sampling fraction fo = 2 9, was 
f rooms and number of 


used, 
The sample totals in each block for number ©} 


moe are as follows. 
lock 
8 9 10 
peek 1 2 3 4 5 6 7 
71 58 
À AIE Bs 
sj 58 56 130.93 109. 99 


Per: 
Sons ils: 80. 329093) Awe 109 
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Hemphill, F. M. (1952). A sample survey of home injuries. Public 
Health Reports, 67. 

Peaker, G. F. (1953). A sampling design used by the Ministry of Educa- 
tion. Jour. Roy. Stat, Soc., A116, 140-165. (A survey of the reading 
abilities of children aged 15.) 


See also the books by Hansen, Hurwitz, and Madow and by Yates. 


EXERCISES 


11.1 By Working out the estimates for all possible samples which can be 
drawn from the artificial Population in Table 11.1, by methods Ia, I5, IT, and III, 
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i f persons in the 
@) Estimate the total number of rooms and the total number o P 
Jr and the average number of persons per room. (b) Compute standard 
errors for the total number of persons and for the average number of persons per 
room. (Use formulas 11.25’ and 11.39, noting that m;z;/M; = faln.) 
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CHAPTER 12 


Double Sampling 


124 DESCRIPTION OF THE TECHNIQUE 


As we have seen, a number of sampling techniques depend on the 
possession of advance information about an auxiliary variate z; Ratio 
and regression estimates require a knowledge of the population mean X. 
If it is desired to stratify the population according to the values of the 
z,, their frequency distribution must be known. 

When such information is lacking, it is sometimes relatively cheap to 
take a large preliminary sample in which v; alone is measured. The 
purpose of this sample is to furnish a good estimate of X or of the frequency 
distribution of x, In a survey whose function is to make estimates for 
some other variate y;, it may pay to devote part of the resources to this 
preliminary sample, although this means that the size of the sample in 
the main survey on y, must be decreased. This technique is known as 
double sampling or two-phase sampling. As the discussion implies, the 
technique is profitable only if the gain in precision from ratio or regression 
estimates or stratification more than offsets the loss in precision due to 
the reduction in the size of the main sample. 

Double sampling may be appropriate when the information about 
2, is on file cards that have not been tabulated. For instance, in surveys 
of the German civilian population in 1945 the sample from any town was 
usually drawn from rationing registration lists. In addition to geographic 
stratification within the town, for which data were usually already avail- 
able, stratification by sex and age was proposed. Since the sample had 
to be drawn in a hurry, and since the lists were in constant use, tabulation 
of the complete age and sex distribution was not feasible. A moderately 
large systematic sample could, however, be drawn quickly. Each person 
drawn was classified into the appropriate age-sex class. From these data 
the much smaller list of persons to be interviewed was selected. 

327 
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12.2 DOUBLE SAMPLING FOR STRATIFICATION 


The theory was first given by Neyman (1938). 1 
The population is to be stratified into a number of classes according to 


the values of x, The first sample is a simple random sample of size n. 
Let 


W, = N,[N = proportion of population falling into stratum h 
wy = ny [n' = proportion of first sample falling into stratum A 
Then w, is an estimate of W,. à i 
The second sample is a stratified random sample of size n in which y, 1s 
measured: 7, units are drawn from stratum h. The second sample is often 


a subsample from the first sample, but it may be drawn independently 
if this is more convenient. 


The cost of the two samples is assumed to be 


C = nc, + n'ey (12.1) 
where c, is usually large in relation to c,'. 

The problem is to choose n’ and the n, (and consequently n) to minimize 
the variance of the estimate for a given cost. We must then verify whether 
the minimum variance is smaller than can be attained by a single simple 
random sample in which y; alone is measured. 


The first step is to set up the estimate and determine its variance. The 
population mean is 


L 
Y= 3 W,Y, 
Aci 
As an estimate we use 


L 
Ys: = 2l Wan 
h=1 

Whenever a new sample is drawn, this implies a fresh drawing of both 
the first and the second samples. Thus the w, and the sample means g, are 
both random variables, subject to error. The problem is therefore one 
of stratification in which the strata totals are not known exactly. The 
strata boundaries are assumed fixed in repeated sampling. 

In the theorems for the mean and variance of J, a slight approximation 
is involved. It is assumed that every w, > 0, that is, n’, is assumed large 
enough so that the probability that any stratum contains no units in the 
large sample is negligible. 

Theorem 12.1. The estimate 7,, is unbiased. 

Proof. Average first over samples in which the w, are fixed. Since 
Jn is the mean of a simple random sample from the stratum, E(J,) = Yr 
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but, when the average is taken over different selections of the first sample, 
E(w,) = W,, since the first sample is also a simple random sample. Hence 


Es) = E(E(S Wan | wa)] = ES wa Yn) i p W, Y, =Y 
Theorem 12.2. If the values of the n, do not depend on the w;, 


L 7 LEA we 
V(y,) = ALA + emam] (1 =S it g WO -Yy (12.2) 
aA . n, 5 


where g' = (N — n')(N — 1) and f, = mÍN.- 

Proof. Average first over samples in which the w, are fixed. Over 
these samples, the mean of y,, is Ew; Y,, so that there is a bias of amount 
E(w, — W,)Y, The conditional variance of y,, is given by theorem 5.3. 
Hence the mean square error is 

iw Ly 1 — f,)S,? L SSE 
Hg, - D] e ZAA + | Sn - WOK] az» 
hel Ny =1 
Now average over selections of the w,. (At this point we use the assump- 
tion that the n, remain constant when the w, vary.) By theorem 3.2, 
WA — Wa) 
Vw) = g' Wil : A. 


n 


so that 
E(w?) = [EQ JE + V6) = We + emo m (1244) 
Also, it is easily shown that 


Bon- Ww, W) = - 5 WW, h) 
For the last term in (12.3), these results give 


i5 EL iL T L L = 
s[ Son — won] = £| 2a - won? -22 wm, 


h=1 i>h 


=E (Sm - p) -EZM — VF 025) 


, 
n' Mia 


Finally, substituting from (12.4) and (12.5) into (12.3), we obtain 
L ' — W, EAN 2 g'W, (Y, a yy 
V(y,) = 2 (m zu eno" Jp) hog A e 
h h 


s the familiar expression for the variance when 


The term free from n’ i 1 
the strata sizes are known exactly. The effects of errors in the first sample 
tribution and to 


are therefore to increase slightly the within-stratum con 
introduce a between-stratum component. 
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In most surveys, interest centers on the current average (3), particularly 
if the characteristics of the population are likely to change rapidly with 
time. With a population in which time changes are slow, on the other 
hand, an annual average (2) taken over 12 monthly samples or four 
quarterly samples may be adequate for the major uses. This, would be 
the situation in a study of the prevalence of chronic diseases of long 
duration. With a disease whose prevalence shows marked seasonal 
variation, the current data are of major interest, but annua! averages are 
also useful for comparisons between different regions and different years. 
Estimates of change (1) are wanted mainly in attempts to study the effects 
of forces that are known to have acted on the population. For instance, 
if a bill is passed which is supposed to stimulate the building of houses, 
it is interesting to know whether the building rate of new houses has 
increased in the succeeding year (with a realization that an increase may 
not be entirely due to the bill). 

Suppose that we are free to alter or retain the composition of the sample 
and that the total size of sample is to be the same on all occasions. If we 


wish to maximize precision, the following statements can be made about 
replacement policy: 


1. For estimating change, it is best to retain the same sample throughout 
all occasions. 


2. For estimating the average over all occasions, it is best to draw a new 
sample on each occasion. 

3. For current estimates, equal precision is obtained either by keeping 
the same sample or by changing it on every occasion. Replacement of 
part of the sample on each occasion may be better than these alternatives. 

Statements 1 and 2 hold because there is nearly always a positive 
correlation between the measurements on the same unit on two successive 
occasions. The estimated change on a unit has variance Sj + SÈ —2pS15» 
where the subscripts refer to the occasions. If the change is estimated 
from two different units, the variance is Sy? + Sj. In estimating the 
over-all mean for the two occasions, the variance is (S? + S2 + 2p5S2)/4 
if the same unit is retained and (S,? + S,?)/4 if a new unit is chosen. 

Statement 3, which is less obvious, is investigated in succeeding sections. 


12.10 SAMPLING ON TWO OCCASIONS 


Suppose that the samples are of the same size n on both occasions and 
that the current estimates are of primary interest. Replacement policy 
has been examined by Jessen (1942). For simplicity, we assume that 


simple random sampling is used and that the population variance S? ci 
y; is the same on both occasions. 
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1 The mean of the first sample has variance S*/n, there being no previous 
information to utilize. In selecting the second sample, m of the units in 
the first sample are retained (m for matched). The remaining u units 
(u for unmatched) are discarded and replaced by a new selection. 

Notation. 

Jy, = mean of unmatched portion on occasion h 

Tnm = mean of matched portion on occasion A 

Üa, = mean of whole sample on occasion h 


1 The unmatched and matched portions of the second sample provide 
independent estimates Feus Jas Of Y, as shown in Table 12.1. In the 


TABLE 12.1 
ESTIMATES FROM THE UNMATCHED AND MATCHED PORTIONS 
Estimate Variance 
deno MIT z 
IE S? 1 
Unmatched: You = Vou $m = Wau 


Matched: Jam = Tom + D — Fam) an 
TIT ALEE N 


matched portion we use a double-sampling regression estimate, where 
the “large” sample is the first sample and the auxiliary variate x, is the 


on the first occasion. The variance Of Jam comes from (12.24), 


value of y; 
; respectively, in 


p. 336: note that our m and n correspond to n and n’, 
(12.24). 

The best combined estimate of Y; is found by weighting the two inde- 
pendent estimates inversely as their variances. If Ww Wom are the 
inverse variances, this estimate is 

Go! = patou += $2) Yom. (12.34) 


where 
Wou 


i F Wou ac Wom 


By least squares theory, the variance of y; is 


1 
VG) e 
Gs) Wou at Wom 
From Table 12.1, this works out after simplification as 
S*(n — up? 
vg) = SO (12.35) 


n? — up? 
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In most applications f, = n;/N, will be negligible. Frequently, n'[N is 
also small, so that g' can be replaced by 1. Under these conditions 


L = SET Y — 1). 
VG.) = [ve E more Sey mo (12.6) 
h Ny, n 

Coroilary 1. The result of theorem 12.2 changes slightly if n, dependi 
on Wa. For instance, if proportional stratification is desired, the samp. 5 
may take n, = nw, in the small sample, since the w, are the best available 
estimates of the W,. More generally, we may have n, = nA wy [ Away 
This substitution is made in (12.3) before finding the average over different 


selections of the w,. After some algebraic manipulation the variance is 
found to be, assuming n,/N, negligible, 


Z mST g Q z'W n — Y) 
KOD maje Ef eo ema, — 
(We) >| n mum Ay, n 
where Q = XA,W,. If 4, = 1 (proportional stratification), then Q = 1 
and we have 
L 2 'W, Y ae Y 
VG) = > [ma di g' Wl : 7] 
h 


Corollary 2. If a proportion is being estimated in the second sample, 


N, 
S à q P9. B0, 
m 


F,- Yy = (P, — Py 
Theorem 12.2 gives, with n,/N, negligible, 
L , ' 2 
V(p,) —- Y [v + £a — w) Fis | SWAP, — PF) 49.7) 
h n ny, n 
where P, is the proportion in stratum h. 1 
Papers by Robson (1952) and Robson and King (1953) extend this 
theory to two-stage sampling, applying it to the estimation of magazine 
readership. 
12.3 OPTIMUM ALLOCATION 
The values of the n, and x’ that lead to the minimum variance are rather 


complicated. It follows from formula (12.2) and the cost function that, 
if n' and n are given, n, should be proportional to 


Sy WE + le Wa — Wal 
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Since the second term inside the root is usually small compared with the 
first, Neyman (1938) suggests taking n, proportional to W,S,. Thus 
swe nW,S, 
t EWS 
When these values are substituted into the variance (12.2), ignoring the 
term in W,(1 — W,) and assuming m lN, n'|N negligible, we obtain 


2 y — ye 
ATE (XE WS) i > WO Y) (12.8) 
n n 
Vn 4, Yu (12.8/) 


RISE yer. (say) 
n n 
This approximate expression for the variance is now minimized by 
choice of n and n' for a given cost of 


C = nc, + n'e, (12.1) 


It is easily found that 
-Z = +— (12.9) 
VV nen’ VV nln 
This equation and (12.1) determine n and n’. 

An expression for the minimum variance is needed for later applications 


of double sampling. From (12.9), 


n n nc, + n'es 


WE Wem Mese utn + Vt? 


= e he io (12.9") 
NEXU, Vien AF VV nent) 
Substitute these solutions in (12.8’) for Vopr This gives 


Vie (Maia + e (12.10) 


i ion that n'|N is negligible, 
If the first sample is very cheap, the assumption t 
that is, that g’ ET may not hold. The solutions (12.9) for n zn n Qnam 
LJ = , k 3 
Satisfactory, the only change needed being to subtract the term Vj, |N from 


(12.10). 

Example. Thi le is artificial, but it illustrates? MIR 
Jean size, is employed to divide the population inta that it costs 10 times as 
sad and farms of more than 160 acres. A Ded and let the cost be 

to sample for corn acres (/;) aS for farm size Pi% iam 


C 2100 =n 01r 


he calculations involved. 
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Note that if u = 0 (complete matching) or if u = n (no matching) this 
variance has the same value, S?/n. 


The optimum value of u is found by minimizing (12.35) with respect to 
variation in u. This gives 


poe eee HET E (12.36) 
^ i+vi-e mn 14 Ji-- 


When the optimum z is substituted in (12.35), the minimum variance 
works out as 


2 -n 
Vos) = Z [1 4- V1 — p] (12.37) 


Table 12.2 shows for a series of values of p the optimum per cent that 
should be matched and the relative gain in precision compared with no 


TABLE 12.2 
Optimum % MATCHED 


Optimum % gain in 
% Matched precision 


matching. The best percentage to match never exceeds 50% and decreases 
steadily as p increases. When p = 1, the formula suggests m = 0, which 
lies outside the range of our assumptions, since m has been assumed 
reasonably large. The correct Procedure in this case is to take m = 2- 
The two matched units are sufficient to determine the regression line 
exactly. 


The greatest attainable gain in precision is 100% when p = 1. Unless 
p is high, the gains are modest. 

Although the optimum percentage to match varies with p, only a single 
percentage can be used in practice for all items in a survey. The right-hand 
columns of Table 12.2 show the per cent gains in precision when one third 


and one fourth of the units are matched. Both are good compromises, 
except for items in which p exceeds 0.95, 
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12.11 SAMPLING ON MORE THAN TWO OCCASIONS 


The general problem of replacement has been studied by Yates (1960) 
and Patterson (1950), with respect to both current estimates and estimates 
of change. When there are more than two occasions, the opportunities 
for a flexible use of the data are increased. On occasion / we may have 
parts of the sample that are matched with occasion A — 1, parts that are 
matched with both occasions A — 1 and h — 2, and so on. In attempting 
to improve the current estimate, we might try a multiple regression 
involving all matchings to previous occasions. It is also possible to revise 


TABLE 12.3 
ESTIMATES OF Y, ON THE hth OCCASION 


Estimate Variance 
Ss? 1 
Unmatched: Gru’ = Jiu Ti EA 
u 
S*(1 — pl) : 1 


Matched: — Gam’ — Jam + bhi — Vni, m) i 


the current estimate for occasion h — 1 after the data for occasion h are 
known. In the revised estimate the regression of occasion h — 1 on both 
occasion A — 2 and occasion h could be utilized, assuming that suitably 
matched portions of the sample were available. 

The present section contains an introduction to the subject. Attention 
will be restricted to current estimates in which only the regression on the 
sample immediately preceding is used. This results in some loss of 
precision, but since the correlation p usually decreases as the time interval 
between the occasions is increased the loss of precision will seldom be 
great. The variance S? and the correlation coefficient p between the item 
values on the same unit on two successive occasions are assumed constant 
throughout. ^ 

On the Ath occasion let m, and u, be the numbers of units that are 
matched and unmatched, respectively, with the (h — 1)th occasion. The two 
estimates of Y, that can be made are given in Table 12.3. The only change 
in procedure from the second occasion (Table 12.1) is that in the regression 
adjustment of the estimate from the matched portion we use the improved 
estimate y; , instead of the sample mean 7;..;- 

The variance of the matched estimate am in Table 12.3 is derived from 
(12.24) at the end of section 12.5. Note that (a) our m corresponds to the 
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This means that if double sampling is not used (n^ = 0) we can afford to take a 
sample of 100 farms.*o estimate corn acres. 
The relevant data for the population are 


Strata Wy S, S, Y, 
1 0736 — 312 17.7 19.404 
2 0214 922 30.4 51.626 
Population 620 26.297 


By (12.10) we could proceed at once to compute V,,, However, the inter- 
mediate steps are given. We find 


V, = (> W,Sj? = 417 


V, = WAY, — Y? = 175 
so that by (12.9) 


From the cost equation (12.11) we obtain 


fe OO Se = i 
" —gssg 170 —n-—170x0488 = 83 


At this point the reader may verify from the data in this example that the 
neglected term in W,(1 — W,) in the variance formula (12.2) is in fact 
negligible. From (12.8) we then have 


Vost = 53g. + 235 = 5.02 + 1.03 = 6,05 
For a random sample of size 100, with no double sampling, we would have 
V = $59 = 620 
Evidently there would be only a trifling gain from double sampling. 
Note. Revertingto corollary 1, of theorem 12.2, an alternative approach 
to the choice of the n, is to write n, — nd,w,]>A,w, and to choose the 4; to 


minimize V(¥,,) as given in corollary 1. It will be found that the optimum 


4, € Sp assuming 1/n’ negligible. For practical purposes this is the same 
solution as that already given. 


12.4 ESTIMATED VARIANCE IN DOUBLE SAMPLING 
FOR STRATIFICATION 


An unbiased estimate of Vs) in (12.2) can be constructed without 
difficulty. We assume n,|N;, and 1/N negligible. 
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Theorem 12.3. An unbiased estimate of V(J,;) is 


p ru : 2 Heo ics 70 
Ga) = > lv? E ee 4 EX = I") (15 15) 
n' —l' n' In, n 
where g’ = (N — n)K(N — 1). 
Proof. By averaging first over samples with fixed w, and then over 
the selections of the w», the expectations of the terms inside the braces 
work out as foliows: 


2 ' = 2 
EXw- [m p UAE ae wj (12.13) 
ry Mh à n’ nh 

ü 2 E 2 
ary £e yp SOs Wr Sw (12.14) 

hom n à» n Mm 

EY gw(yy — Ja = #(s gw? ic sie) 
D n LY n' 
ya L 2 172 , 5 

-yé mY, m r£ M gr Hg Hs (12.15) 


m n mn n onm, 

Adding these three equations, we obtain, on comparing with (12.2), 

' — DE, - 1 ,.N(n-—1 _ ,(n'—1 

@! = DEW) vahi- E) = voe = VG) 
n n (N — 1)n n 


assuming 1/N negligible. This completes the proof. 
If n' is large relative to the m,, v(¥,,) reduces to 
2 
vj.) = X wy? (12.16) 
Bit 

This expression is equivalent to assuming that errors in the strata 
weights w, can be ignored. 
is the observed proportion of units in stratum h which 


Corollary. If 
d —YwpYw, is the estimate of the 


fall into some defined class and pst , 
population proportion, then an estimate of V(p,) is 


"nir 2. Ewa) Pads n g'wi(py — b] 
v(Pet) FH POE TÈ [v n JE exi n 


In almost all cases this can be simplified to 


Wa Dad Wa(Pn — Pst) 
od => |e en 
h LM, 


Frequently the term in 1/n' can also be dropped. 
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Example. Ina simple random sample of 374 households 292 were occupied 
by white families and 82 by nonwhite families. A subsample of about one in 
four households gave the following data as to ownership: 


Owned Rented Total 


White: 31 43 74 
Nonwhite: 4 14 18 


(—— ee E 
Estimate the proportion of rented households in the area from which the sample 
was drawn and find the standard error of the estimate. 
If the first stratum consists of the white-occupied households, 

w = $32 = 0.78, w = £2, = 0.22 

Pi = 4# = 0.60, Pa = 44 = 0.78 

Pst = Wıpı + Wops = 0.64 

n' = 374, n, = 74, ny = 18 


It is readily found that only the leading term in v(p,,) is of importance. Hence 


2 2 
(p) = X m Pe — (0.78) Caen x (0.22)*(0.78)(0.22) 


h 17 
= 0.00248 


(Pst) = 0.049 


The estimated proportion of rented households is 0.64 + 0.049. The reader may 
verify that there is only a trifling gain in precision over a single-stage simple 
random sample of size 92. In view of the relatively small size of the nonwhite 
Stratum, a greater difference between the Proportions of rented households for 
whites and nonwhites would be necessary to make double sampling profitable. 


12.5 REGRESSION ESTIMATES 


In a number of the applications of double sampling the auxiliary variate 
x; has been used to make a Tegression estimate of Y. We shall assume 


that the population is infinite and that the relation between y; and %; is 
linear. Write as a model 


Yia = Y + B(x, — X) + e; (12.17) 


where the second subscript « is introduced as a reminder that for fixed 


z; the random variate e,, follows a frequency distribution with mean 0 and 
Variance S? = §,2(1 — p2), 


In the first (large) sample, of size n', we measure only z;; in the second, 
of size n, we me 


asure both z; and y,,. The estimate of Y is 
Ir = 9 + bz — 2) 


A5. ms t . 
where 7’, z are the means of x, in the first and second samples, respectively» 
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and b is the least squares regression coefficient of y;, on z;, computed from 
the second sample. 
We now examine the error of estimate (y,, — Y). From (12.17) we find 


g=Y+Be@-—X+e (12.18) 


S Wa- 9 — 2) 
D — 


n 


> (z; — zy 
i=l 
È eilz: — &) 
X (x, — zy 
i=l 
From (12.18) and (12.19), substitute for 7 and b in the error of estimate, 
This gives 
Tn — Y=(G— Y) + b@ — 3) 


= Bet (12.19) 


= B(z— J) + e+ BE —2)+(@ — 8) ee 
d oil ay Cia th) a ayo 
= é + (x — 2) SG -35 (s — 9 + B(x X) (12.20) 


In ordinary regression theory, in which z' — X, the standard practice 
is to discuss the conditional frequency distribution of the error of estimate 
(gj, — Y) in repeated samples in which the x, values are fixed. If this 
approach is adopted in the present problem, keeping the x, values fixed 
in both the first and the second samples, we see that the estimate is biased 
in the conditional distribution, since 


E (Ur — Y) = B — 3) 
Hence the conditional mean square error (MSE) of 7, is 
" 1, (z-—zy was Ns 
MSE (3) = S1 — p 2 =] mg qXq 
(Vir) = S U — p) "ub ETT + B(z — Xy (1221) 
This expression is not suitable for comparison with other methods of 
sampling, since the MSE depends on the set of z; which appears in-the 


two samples. Instead, we need the average MSE over all possible drawi 
of the first and second samples. à us 


A. simple result is obtained under the assumption that (a) the first 
sample is drawn at random, (b) the second sample is a random subsample 
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drawn from the first, and (c) the x; are normally distributed. In this event 
the average MSE is found to be 


Ne ong des. 3T. — | B'S? (1222) 
Vane Sab =P) E 3 ( zm —3) e. i 
ET [ CEDE ] 4 Sy (223) 
n n^ (n—3) n 


since B?S,? = p*S?. . 

If the x; are not normally distributed, the only term whose value 1s 
changed is that in 1/(n — 3). In regard to assumption (b), the small 
sample might not be drawn at random from the large sample: it is prefer- 
able to select the small sample to obtain a wide spread in the values of 
x, and reduce the sampling error of b. The effect is to reduce, perhaps 
considerably, the term in 1/(n — 3). . 

In some applications the second sample is drawn independently of the 
first. In this event the argument given in this section remains unchanged 
down to (12.21). In (12.22) the term 


ncm: 
is replaced by 
1 
em 
nan 


This case of two independent samples was first considered by Chameli 
Bose (1943). 

To summarize, there is some doubt about the exact value of the term 
in 1/(m — 3) in the average variance. However, if 1/n is negligible, this 
term is also negligible. This gives the following theorem. 


Theorem 12.4. If the first sample is size n', the second is size 7, and 


I/n is negligible, the variance of j,,, the regression estimate in double 
sampling, is given approximately by 


Vg) = $C =e), es? (12.24) 
n n 


12.6 DOUBLE SAMPLING WITH REGRESSION VERSUS 
SINGLE SAMPLING 


From the variance formula (12.24), double sampling with a regression 
estimate can be compared with a single simple random sample under Hic 
assumption that (a) the first sample is a simple random sample, (5) 1n 35 
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negligible, and (c) the second sample is also a simple random sample. 
Results for this case should provide a rough guide to other cases. 
Write 
- Va Yu 
VGin) = + 
n n 
where 
Vase (is p) Vy = pS? 
cost = C = nc, + n'e, 


The problem of finding the optimum 7 and n' and the minimum variance 
is exactly the same as in double sampling for stratification (section 12.3). 
Equation 12.10 gives 


V = G Vln + VV qn)? 
opt — 
2 a) Em 
Sé — ges + even] (12.25) 
Cc 
where p is taken as positive. 


If all resources are devoted to a single sample, with no adjustment for 
regression, this sample has size n, = C/c, and the variance of its mean is 


2 2 
yg) = S = Se (12.26) 
Hence, double sampling gives à smaller variance if 


e, > IV — Pen + ove, 


This inequality may be expressed in two Ways: 


EN GENEE heo. pd (12.27) 
Cw p 1-1 - P? 
or 
2 ACC 
Ae _ (12.28) 


2 
PT (ee 
Equation 12.27 shows that for a given value of p the ratio of the cost per 
unit in the second sample to the cost per unit in the first sample must 
exceed a critical value before double sampling brings an increase in 
precision. Given c; and c,,, (12.28) shows the critical value that must be 


exceeded by p? to make double sampling profitable. 
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n in (12.24) and (b) the term p2S,2/n’, which comes from the pe 
BE@' — Xy in (12.21), is replaced by p*V(y, ,), since B =p an 
Tr—ı corresponds to z' in the earlier analysis. j 4 

We now examine the precision obtained if the optimum m, and u, an 
the optimum weights are used on every occasion. It will be found that 
the optimum m,/n, increases steadily on successive occasions, rapidly 
approaching a limiting value of 3. irt 

Weighting inversely as the variarice, the best estimate of gy, is 


V. = 4p t (L = $)pu (12.38) 
where $, = W,(W,,, + Wam). This gives 


za 1 g,S? 
VG’) = =e 

Wru oF Wim n 
where g, denotes the ratio of the variance on occasion / to that on the 


first occasion. Substituting for W, W,m from Table 12.3, we have 


Sen 1 12.39) 
=— = SW + Wom) = u, + a2. 
aona EFM PCE nam 
m, n 
We now choose m; and u, to maximize this quantity and therefore to 
minimize V(i). Writing u, = n — m, and differentiating the right side 
of (12.39) with respect to m, we obtain 


Tipe). (: FHR Hera 
2 
m; m, n 
This gives, on solving for the optimum 71, say, 
in vi- p 
EIL Y o ERES S INS 
^o B0 + V1 — p) 


When this value is substituted in (12.39), the relation becomes, after 
some algebraic manipulation, 


(12.40) 


e SENER (12.41) 
Enr & 14 V1 — P) 
This relation may be written 
m=1-+ br, 


where r, = I/g, and r, = ljg = 1. 


: ce 
i 1=1. Repeated use of this recurren 
Telation gives 


es be =1+b4p%4-..4 paci- 
EM p 
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where, from (12.41), b = (1 — V1 — p*)/( + 1 — p*). Since0 <b <1, 
the limiting variance factor g is 


2. EA 
£,—1—b-— CAVIA (12.42) 
1+V1-— 
Hence the variance of 7,’ tends to 
2 2. —2 
Va) =E (Mee =>] (12.43) 
nM +V- p 
Finally, the limiting value of zñ, is obtained from (12.40) as 
fi Vi- è 


—2 = 


1 
n gol+Vi—p) 2 
irrespective of the value of p. 

Table 12.4 shows the optimum percentage to match—100/1,/n, as found 
from (12.40)—and the resulting variances for p = 0.7, 0.8, 0.9 and 0.95 
and for a series of values of h. 

TABLE 12.4 
OPTIMUM % MATCHED AND VARIANCES 


Liv nV S? 


% matched 100r /n 


p= ` 
0.7 0.8 0.9 0.95 


p= 
07 08 09 0.95 


h 


the optimum per cent matched is close to 50 
though a smaller amount of matching is 
occasions. The reductions in variance, 


that is, (1 — g,), are modest if p is less than 0.8. 


By the fourth occasion, 
for all the values of p shown, al 
indicated for the second and third 


12.12 SIMPLIFICATIONS AND FURTHER DEVELOPMENTS 


lication the preceding analysis may need modification. 
lacement policies cost the same and are equally 
ld costs are likely to be lower if the 
If estimates of the 


In practical app. 
We assumed that all rep 
feasible. With human populations, fie! 
same units are retained for a number of occasions. 
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change in the population total or mean are of interest, this factor also 
points toward matching more than half the units from one occasion to 
the next. 

It is convenient also to keep the weights and the proportion matched 
constant, rather than to change them on every occasion. Consequently, 
we shall investigate the variances of Z,' and of the estimated change 
(Fr — x1) when m, u, and ¢ are held constant. We continue to write 


Vn) = g,S?/n, although the actual value of £y Will be different from that 
in the preceding section. 


The estimate is now 
yy = PI nu. rl P)Inm 


Substituting the expressions for the two variances (from Table 12.3), we 


have 


va) = BE = iva, + 1 — DVn) 
-s[É 0-9- P) , sra = efe. 
u 4 m 


n 
Hence 


2 
g, = [= a | Top — ig, 4 (12.44) 


u 
where u = u/n, A = m[n. Write this relation as 


&& =a + bg, , 
By repeated application, we have, since g, = 1 


Sal — p) 
1—b 


is less than 1, the limiting value is 


En Top 


Since b = p%(1 — 4 


go =— = MP + ul — dy — p 
cm zs Se PINE SIPs) 
1—b Aull — p'(1 — 4] 

The value of the weight 
be found by differentiatin 
Whose appropriate root is 


(12.45) 


9 which minimizes the limiting variance may 
g (12.45). This leads to a quadratic equation 


MAPE Ga ai "in 
| A V1— pll/1— p+ aaa — JT e] 
opt — a. UI E EE ELEM 
24p 
, In practice, the value of p will not be known exactly and will differ from 
Hem to item. A simple compromise value can usually be chosen. Clearly» 
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Popi Will be less than y = ujn, since the matched part of the sample gives 
higher precision per unit than the unmatched part. For example, with 
Li = 0.25, that is, } of the sample unmatched, dope turns out to be 0.216, 
0.198, and 0.164 for p = 0.7, 0.8, 0.9. The choice of ¢ = 0.2 would be 
adequate for this range of p. 

For the estimate of change, we have 

VG, — Fa) = VG) + Via) — 2 Cov (Urr) (12.46) 

To find the covariance term, note that if jj; i1, are the values for the 

ith unit in the matched set on occasions A and (h — 1), our model is 
Yni = Yn + pui — Yn) + e 

where the e,; are independent of the y's. From this model it is found by 
substitution that 
)= Y, + pii — Y,3) + eam 
j. But 


Vim = nm + Ppa — Yn-1,m 
Hence the covariance of Jam and jj. is PKU- 


Cov (3,31...) = Cov {[PFnu + (1 — JW lia) = PU — PV r- 
since j,, is independent of 7; .,. From (12.46), this gives 


VG — 9) = i {g, + gall — 2p — 4)) (12.47) 


From (12.44) and (12.47), the variances of gx and (Vj — 9-1) may be 
computed for any values of m, $, and p. Table 12.5 shows these variances 
for A = m[n = } and $. The weight ¢ was taken as 0.35 for 4 = 4 and 
0.2 for A = 2. r 

The results indicate that an increase in the proportion retained from 
3 to ł produces only small increases in the variance of the current estimate 
and gives substantially larger reductions in the variance of the estimate 
of change. For example, with p = 0.8, the increase 1n V(9y,) is about 5%, 
whereas the decrease in (Jy — Jai) is more than 20%. This suggests 
that retention of $, 2, or $ from one occasion to the next may be a good 
practical policy if current estimates and estimates of change are both 


wanted. 

Comparison of V(7;') for 
in Table 12.4 shows that lit 
and a fixed A = 1. ] 

If p exceeds 0.8, the regression coefficient i 
with only a smali additional loss of precision. 


J,” of the form 
y. = Pnu 


À = fin Table 12.5 with the optimum variances 
tle precision is lost by using a constant weight 


b = p may be replaced by 1 
This gives an estimate 


+11 Ara + Tnm — Vaio) (12.48) 
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In the important Current Population Survey taken monthly by the 
U.S. Bureau of the Census, one quarter of the second-stage units are 
replaced each month, so that an individual household remains in the 
sample during four consecutive months. The household is omitted for 
the eight succeeding months but is then brought back for another four 
months, thus increasing slightly the precision of year-to-year comparisons. 

The composite estimate used in this survey is of a form related to (12.48) 


but slightly different. 
Gn” = (1 —K)ia + K(gia + Frm — Frim) (12.49) ` 


where K is a constant weighting factor. The difference is that J» the 


current estimate for the whole sample, takes the place of the yj, in (12.48). 
The quantities Jum» Jn-1, j, in (12.49) are ratio estimates of a fairly 
complex type. The variance of J,” (due to Bershad) is given in Hansen, 
Hurwitz, and Madow (1953). Since the primary units remain unchanged, 


only the within-units component of V(j,") is affected by this replacement 


policy. 
ew sample is drawn on each. occasion, 


In another rotation policy a n 
with no matching. With monthly sampling, this plan is appropriate when 


annual estimates, and to a lesser extent semiannual or quarterly estimates, 
are of primary importance, for example, in an illness survey with emphasis 
on chronic diseases. If the questionnaire obtains for any unit the results 
for the preceding month as well as for the current month, we can consider 


composite estimates of the form 

jy = Yn + di Ui = Tran) (12.50) 
where jj, = estimate made from current data in the current sample, 
estimate made from previous month’s data in the Current 
sample, 
yj, = composite estimate for the previous month, 


li 


Yn-1,h 


Te theory is discussed by Hansen, Hurwitz, and Madow (1953) and 
ae S (1959), who apply it to a survey of retail sales, and by Eckler 
: ). In the Retail Trade Survey the composite estimate involves 
ratio estimate, being of the form [ 
w = 0 - Wyn W(2h- Jai. 

Ynrin 
where W is a weighting factor. Since month-to-month correlations are 
very high, averaging around 0.98, the gains in precision are substantial 
One month later a revised composite estimate for month A is computed, 
using the results for month h from the new sample taken in month (h + 1. 
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With this method, it is essential that the data obtained for the preceding 
month from the current sample are accurate. This may not be so Mes 
the data depend on the unrecorded memory of the respondent, di 
the method may work successfully if the data are ofa type that the respond- 
ent records carefully as a routine matter, 


EXERCISES 


12.1 $3000 is allocated for a survey to estimate a proportion. The d 
survey will cost $10 per sampling unit. Information is available in files, at a Mex 
of $0.25 per sampling unit, that enables the units to be classified into two EA 
of about equal sizes. If the true proportion is 0.2 in stratum 1 and 0.8 in stra 5 3 
2, estimate the optimum n, n’, and the resulting value of V(p,,). Does dou! N. 
sampling produce a gain in precision over single sampling? (The ratios n'/N, 
n/N, may be ignored.) s 

12.2 For the WA, P, in exercise 12.1, find the cost ratios cy[c,' for which 
double sampling is more economical than single sampling. y 

12.3 A population contains L strata of equalsize. If V,,,, denotes the variance 
of the mean of a simple random sample and V,,, V4, are the Correspon ab 
variances for stratified random sampling with proportional allocation and 
double sampling with stratification, show that, approximately, 


2% x5 
Vus = 8,2 + 


nV, = $2 
ne Fn = y) 


nV;, = 52 + n 


where 5,? is the average variance within strata, (N and n’ may both be MD 
large relative to L, and the my in double sampling may be assumed equal to 7 the 
Hence, if (RP),, denotes the relative precision of the stratified sample to 


" * t 
Simple random sample, with a Corresponding definition for (RP),,, show tha 


x (RP), : 
(RP)a, 1 + (n/n (RP), — 1] 


For (RP),, = 2, plot (RP), apai t n/n’ is ratio be in order 
int (RP, 4 ; (RP),, agains n/n’. How small must this ra 
124 If) = 


H G AE 
t 0.8 in double sampling for regression, how large must n’ berelativ 
to n, if the loss in precision due t. 


3 © sampling errors in the mean of the larg? 
Sample is to be less than 10%? le 
12.5 In an application of double sampling for regression, the small smp 
was of size 87 and the large sample of size 300. The following computati 
apply to the small sample: 
2-H = 17283, Soole, — 3 =5114, J (æ; — z} = 3248 
Compute the s 


tandard error of the Tegression estimate of Y. 
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12.6 For p — 0.95, verify the data given in Table 12.4 for the optimum 
percentage which should be matched and for the gain in precision relative to no 
matching. Compute the corresponding per cent gains in precision if one third 
of the units are retained from the first to the second occasion and one halt of the 
units are retained on each subsequent occasion. 

12.7 Insimple random sampling on two occasions, suppose that the estimate 
On the second occasion is, in the notation of section 12.10, 


Fe” — (Ul — dA + Fam — Tim) + $E 
(a) Ignoring the fpc, show that » 


s: QU a — 29] 2) 
Sf p es tae eo 


where A = mín, u = ujn. (b) For given p, 4, u, find the value of ¢ that minimizes 
Víg"). Show that if p exceeds 4 the best weight ¢ lies between x and u/(1 + u). 

12.8 For n=}, u =}, p = 0.8, and p = 0.9, compare V(j2") in the pre- 
ceding exercise with the variance of the optimum composite regression estimate 
Go’, as given by equation 12.35. (In gp” take $ = 0.2 when x =4 and $ = 0.4 
when y = 1.) Verify that for these values of p the estimate 7,” is almost as 
precise as 7,’ for both « = ] and » = 3. 

12.9 An independent sample of size n is drawn each month. From the sample 
taken in any month, data are obtained for the current and the preceding month. 
A composite estimate ,’ is made as in (12.50), section 12.12. 


V(Ge") = 


Gn = In + dai — Tran) 
The model is 


Yui = Y, ap phai A Y, 3 + eni 
where e,; is independent of the y’s and has variance (1 — p). Show that 


(a) 
Gn — Y, = n daa — Ya) + (6 — da) Gia — Yr) 
(6) If V(g,^) = gnS?/n, where S? is constant on all occasions, 
gn =(l— p*) + diga a + (p — $n)” 
(c) The optimum ¢, = p/(1 + g,—1) and the resulting optimum gy is 


p 


ns 1+ Sh-1 
(d) The limiting g» is g = VI — p. These results were given by Eckler (1955). 


& = 
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CIMA R TE RS 


Sources of Error in Surveys 


13.1 INTRODUCTION 


The theory presented in preceding chapters assumes throughout that 
some kind of probability sampling is used and that the observation y; on 
the ith unit is the correct value for that unit. The error of estimate arises 
solely from the random sampling variation that is present when n of the 
units are measured instead of the complete population of N units. 

These assumptions hold reasonably well in the simpler types of surveys 
in which the measuring devices are accurate and the quality of work is 
high. In complex surveys, particularly when difficult problems of measure- 
ment are involved, the assumptions may be far from true. Three additional 
sources of error that may be present are as follows: 


1. Failure to measure some of the units in the chosen sample. This may 
occur by oversight, or, with human populations, because of failure to locate 
some individuals or their refusal to answer the questions when located. 

2. Errors of measurement on a unit. The measuring device may be 
biased or imprecise. With human populations the respondents may not 
possess accurate information or they may give biased answers. 

3. Errors introduced in editing, coding, and tabulating the results. 


These sources of error necessitate a modification of the standard theory 
of sampling. The principal aims of such a modification are to provide 
guidance about the allocation of resources between the reduction of 
random sampling errors and the reduction of the other errors and to 
develop methods for computing standard errors and confidence limits 
that remain valid when the other errors are present. 


132 EFFECIS OF NONRESPONSE 


We shall use the term nonresponse to refer to the failure to measure 
some of the units in the selected sample. In the study of nonresponse it is 
355 
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as sirata the 
convenient to think of A Et as EU Hn r.a Ud b 
first consisting of all units for which measur dM ped 
i ened to fall in the sample, the second of the units for 

me would be obtained. The compositions of the nd uei: 
depend intimately on the methods used to find the units and o £d. 
data. A survey in which at least threc calis are made, if necessa: P? 21 
every house and in which a supervisor with exceptional powers o PN 
suasion calls on all persons who refuse to give data will have a m 


TABLE 13.1 
RESPONSES TO THREE REQUESTS IN A MAILED INQUIRY 


Average Number 
Number of % of of Fruit Trees 
Growers Population per Grower 
= SE ee . 


Response to first mailing 300 10 i 

Response to second mailing 543 17 un 

Response to third mailing 434 14 34 

Nonrespondents after 3 mailings 1839 59 290 
Total population 3116 


100 329 


smaller “nonresponse” Stratum than one in which only a single attempt 
is made for every house, r 
This division into two distinct strata is, of course, an de 
Chance plays a part in determining whether a unit is found and measur Hs 
in a given number of attempts. In a more complete specification of id 
problem we would attach to each unit a probability representing n. 
chance that it would be measured by a given field method if it fell in t 
sample. 
The sample provides no informatio 
This would not matter if it could 
stratum 2 are the same as those o; 
made, however, it has often been 
stratum differ from units that are 


n about the nonresponse stratum F 
be assumed that the characteristics O: 
f stratum 1. Where checks have pesn 
found that units in the “nonresponse 

measurable. An illustration appears zi 
Table 13.1. The data come from an experimental sampling of s 
orchards in North Carolina in 1946. Three successive mailings of t 
Same questionnaire were sent to growers. For one of the quen 
number of fruit trees—complete data were available for the populatio 
(Finkner, 1950). 


H . H i rs 
The steady. decline in the number of fruit trees Dee SLOW en wees 
cessive Tesponses is evident, these numbers being 456 for respondents 
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the first mailing, 382 in the second mailing, 340 in the third, and 290 for 
the refusals to all three letters. The total response was poor, more than 
half the population failing to give data even after three attempts. 

We now consider the effects of nonresponse on the sample estimate. 
Let N,, N, be the numbers of units in the two strata and let Wi = NN, 
W, = NJN, so that W, is the proportion of nonresponse in the population. 
Assume that a simple random sample is drawn from the population. 
When the field work is completed, we have data for a simple random 
sample from stratum 1 but no data from stratum 2. Hence the amount 
of bias in the sample mean is 


E) — Y = Y, — Y = Y, — (WAY, + WY.) 
= WAY, — Y) (13.1) 


The amount of bias is the product of the proportion of nonresponse 
and the difference between the means in the two strata. Since the sample 
Provides no information about Y,, the size of the bias is unknown unless 
bounds can be placed on Y, from some source other than the sample data. 
With a continuous variate, the only bounds that can be assigned with 
certainty are often so wide as to be useless. 

Consequently, with continuous data, any sizable proportion of non- 
response usually makes it impossible to assign useful confidence limits 
to Y from the sample results. We are left in the position of relying on 
Some guess about the size of the bias, without data to substantiate the 
guess, 

In sampling for proportions the situation is a little easier, since the 
unknown proportion P, in stratum 2 must lie between 0 and 1. If W, is 
known, these bounds for P, enable us to construct confidence limits for 
the population proportion P. Suppose that a simple random sample of 
? units is drawn and that measurements are obtained for n, of the units 
In the sample, Assuming n, large enough, 95% confidence limits for P, 
are given by 

Pı + 2 A pydilrts 
Where p, is the sample proportion and the fpc is ignored. 

When we try to derive a confidence statement about P, we are on safe 
round if we assume P, = 0 when finding Ê; and P, = 1 when finding 
Py. Thus we might take, for 957; limits, 


b, = W(P — 2 pain) + W0) (13.2) 
Py = Wp, + 24 Pg) + W1) (13.3) 


It is easy to verify that these limits are conservative, that is, that 


Pr(B, € P € Py) > 0.95 
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The limits can be narrowed a little by a more careful argument (Cochran, 
Mosteller, and Tukey, 1954), since P, cannot be 0 and 1 simultaneously, 
as assumed above. 

The limits are distressingly wide unless W, is very small. Table 13.2 
shows the average limits for a sample size n = 1000 and a series of values 
of Ws and p,. Since the limits in (13.2) and (13.3) depend on the value of 
n (number of respondents in the sample), we have taken n, = nW its 
average value, in computing Table 13.2. 


TABLE 13.2 
9575 CONFIDENCE LIMITS For P (97) WHEN n = 1000 
Nonresponse, Sample Percentage, 100p, 
100, 5 10 50 
0 (3.6, 64) (81,119) (17.5, 22.5) — (46.7, 53.2) 
5 (3.4, 11.1) — (76,163) (16.5, 26.5) (44.4, 55.6) 
10 G2,158) — (72,208) (15.6, 30.4) — (42.0, 58.0) 
15 (3.0, 20.5) (68,255) (14.7, 34.3) — (39.6, 60.4) 


48,252) — (63,297) (13.7, 38.3) — (372, 62.8) 


ded to onfidence interval if W, were zero. 
This is easily done when Pris 50%. For W, = 5%, Table 13.2 shows that 
the half-width of the confidenc 


€ interval is 5.6, The equivalent sample 
Size n,, assuming no nonresponse, is found from the equation 


3.6 = 2,/(50)(50)/n, 
n, = 320 
_ For W, = 10, 15, and 20 76 the values of n, are 155, 90, and 60, respec- 
üvely. It is evidently worthwhile to devote a substantial proportion of 
c on of nonresponse, à 
An Interesting method of finding Sample size when nonresponse i$ 


um and Sirken (1950a, 19505). ne 
umed known from previous experience 


_ t2PQ 
d LA 
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where t, is the normal deviate corresponding to the risk x that the error 
exceeds d. With no advance information about P, we would take P = 0.5 
as the least favorable case, giving 
12 
a= E 13.4) 
4d* f 


By taking the least favorable combination of the bias WP, — P3) and 
TABLE 13.3 


SMALLEST VALUE or m FOR GIVEN LiMir OF ERROR d, witH Risk « = 0.05 


o, 
o 


Nonresponse, d(%) 
100W, 20 15 10 5 

0 24 43 96 384 
2 27 50 122 653 
4 31 60 166 2000 
6 36 75 255 Boc 
8 43 99 521 

10 33 142 . 

15 112 


the value of P4, Birnbaum and Sirken show that a value of n which still 
guarantees an error less than d, with risk «, is 
ut e n (13.5) 
4d(d — WW, 
Note that no value of n suffices if W, > d. If W, = 0, this equation 
reduces to (13.4) apart from the term —1, which comes from an apprọxi- 
mation in the analysis. Some values of n given by Birnbaum and Sirken’s 
method are shown in Table 13.3. 
This table tells the same sad story as Table 13.2. If we are content with 
4 crude estimate (d = 20), amounts of nonresponse up to 10% can be 
handled by doubling the sample size. However, any sizable percentage 
of nonresponse makes it impossible or very costly to attain a highly 
Buaranteed precision by increasing the sample size among the respondents, 


133 TYPES OF NONRESPONSE 


Some methods for handling the nonresponse problem are described in 
Succeeding sections. A rough classification of the types of nonresponse 
i$ as follows, 
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1. Noncoverage—failure to locate or to visit some units in the sample. 

This is a problem with areal sampling units, in which the interviewer 
must find and list all dwellings (according to some definition) in a city 
block. It arises also from the use of incomplete lists. Sometimes weather 
or poor transportation facilities make it impossible to reach certain units 
during the period of the survey. 

2. Not-at-homes. This group contains persons who reside at home 
but are temporarily away from the house. Families in which both parents 
work and families without children are harder to reach than families with 
very young children or with old people confined to the house. F 

3. Unable to answer. The respondent may not have the information 
wanted in certain questions or may be unwilling to give it. Skillful wording 
and pretesting of the questionnaire are a safeguard. 

4. The “hard core.” Persons who adamantly refuse to be interviewed, 
who are incapacitated, or who are far from home during the whole time 
available for field-work constitute this sector. It represents a source of bias 
that persists no matter how much effort is put into completeness of returns. 


The detection and measurement of noncoverage are difficult. With 
areal sampling, 


listing that ser 


of which is to sample parts of the town 
covered by the directory and 


one person per family is uneconomical. In this connection, a useful 
method of selecting a single 


for smaller or larger households. 


The interviewer lists on the schedule the eligible persons in the household 
and then numbers them: males first in order of decreasing age, then 


| — 
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females in order of decreasing age. Each schedule has printed on it one of 
the sets of instructions in Table 13.4. 

Each eligible person in a household of a given size has an equal chance 
of being selected, except that adults 3 and 5 in households of size 5 are 
slightly overrepresented. Since male respondents are concentrated in 


TABLE 13.4 
INSTRUCTIONS FOR SELECTING A SINGLE RESPONDENT 


If Number of Adults in 


Relative mabi Household is 
Frequency EI 
of Use Number 1 | 2 3 4 5 6 
Select Adult Numbered 

1/6 A e a a a 
1/12 BI eod on o9 sa 
1/12 B2 1 1 1 2 2 2 
1/6 C 1 1 2 2 3 3 
1/6 D 1 2 2 3 4 | 4 
1/12 El 1 el a aera anh 3 
1/12 E2 epee Ze) || S 
1/6 F 1e. Eon SE INA SM a 


Tables A, B, and C, the interviewer can devote evening calls to households 
$0 designated. 
13.4 CALL-BACKS 


A Standard technique is to specify the number of call-backs, or a 
minimum number, that must be made on any unit before abandoning it 
as “unable to contact.” Stephan and McCarthy (1958) give data from a 
number of surveys on the percentage of the total sample obtained at each 
call. Average results are shown in Table 13.5. 


TABLE 13.5 
NUMBER OF CALLS REQUIRED FOR COMPLETED INTERVIEWS 


% of Sample contacted on 
First Second Third or Later Per Cent 


Respondent Call Call Call Nonresponse Total 
Any adult* 70 17 8 5 100 
Random adult 37 32 23 8 100 


* Two Surveys in which the respondent was à housewife and a farm operator, 
Tespectively, have been included in the “any adult" group. 
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In surveys in which any adult in the house could answer the TET 
the first call obtained about 70% of the sample and the gg w P a 
87%. The increased cost of sampling when a randomly EIS ob 
to be interviewed is evident, the first call producing only 3775 o ate 
Tequired interviews. The marked success of the second cail reflects d 
work of the interviewer in finding out in advance when the desired respon 

would be at home and available. 

E has been published on the relative costs of later calls to the M 
call. Later calls would be expected to be more expensive per E gai 
interview, since the houses are more sparsely located in the area assigne 
to the interviewer and since the occupants are presumably people who 
spend more than an average amount of time away from home. rn 
British experience, Durbin (1954) suggests that later calls may be ge 
expensive than would be anticipated. The following figures show oe 
mated relative costs per completed interview (i.e., money spent on ith ca 3 
divided by number of new interviews obtained) for each call up to the fiftl 
in a special study reported by Durbin and Stuart (1954). 


TABLE 13.6 
RELATIVE Costs PER NEw COMPLETED INTERVIEW AT THE [TH CALL 
Call 1 2 B 4 5 
Seren ae 151250 


The estimation of these costs requires care. If the desired respondent 


is not at home at the first call, the interviewer may spend time inguiring 
when this person will be at home and making a tentative appointment. 
In the costing such time should be assigned to the second call rather than 
to an unsuccessful first call, y 
the average cost per completed interview 
o the ith call. These figures give the relative 
d interviews when we insist on i calls before 
n order to compute these figures, we must 
are obtained at each call. In Table 13.7 these 
calculations are made under two sets of assumptions. The first simulates 
surveys in which any adult can answer the questions, the second those 


demanding a random adult, The data on numbers of interviews obtained 
were taken from Table 13.5, 


The details of the calculation are 
e method being exactly the same 
the original sample size. 


shown only for the first cont 
for the second. The symbol z, denote 
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Insistence on up to three calls costs only 4% more per completed inter- 
view than single calls if any adult is a satisfactory respondent, and only 
10% more if a random adult must be interviewed. How typical these 
results are is not known, but the method provides realistic estimates of 
the cost of insisting on call-backs if the necessary cost and sample size 


TABLE 13.7 
RELATIVE COSTS PER COMPLETED INTERVIEW UP TO THE ith CALL 
Respondent = Any Adult 


"Random" 
At ith Call Up to ith Call Adult 
No. Cost Total ^ Cos No. Cost 
Call Relative of of No.of Total per cf per 
Cost Ints.* Ints. Ints. Cost Inte) pts! Int 


1 100 0.70n, 70n, 0.705, TON, 100 0.37n, 100 
2 112 — Q.17n, 19.04n, 0.87%, 89.04m, 102 0.32n, 106 
3 127 0.07m 8.89, 0.94n,  97.93n, 104 0.16m, 110 
4 151 0.04n,  6.04n, 0.987, 103.97% 106 0.09n, 114 
5 250 0.02n, 5.007, 1.00n, 108.977, 109  0.06m, 122 


* Interviews 


data have been collected. There is also the time-factor: call-backs delay 
the final results. 


13.5 A MATHEMATICAL MODEL OF THE EFFECTS 
OF CALL-BACKS 


Deming (1953) developed a useful and flexible mathematical model for 
examining in more detail the consequences of different call-back policies. 
The population is divided into r classes, according to the probability that 
the respondent will be found at home. Let 


Wi = probability that a respondent in the jth class will be reached on or 
before the ith call 

P; = proportion of the population falling in the jth class 
H; = item mean for the jth class 
a; = item variance for the jth class 

For simplicity we assume w;; > 0 for all classes, though the method is 
easily adapted to include persons impossible to reach. If ¥,; is the mean 
for those in class j who were reached on or before the ith call, it is also 
assumed that E(j;)) = u;. 
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The true population mean for the item is 
= 2 Pil; (13.6) 


Consider the composition of the sample after i calls. The persons in the 
sample can be classified into (r + 1) classes as follows: in the first class 
and interviewed; in the second class and interviewed; and so on. The. 
(r + 1)th class consists of all those not yet interviewed after i calls. If the 


fpc is ignored, the numbers falling in these (r + 1) classes are distributed 
according to the multinomial 


[wapi + Wis Ps +i + Wap, + 0 — X wap)" 
where n, is the initial size of the sample. 


It follows that the number n; who have been interviewed in the course 


of i calls is binomially distributed with number of trials = n, and proba- 
bility of success È w,,p;. Hence 


è r 
E(n;) = expected number of interviews in i calls = nS wap, (13-7) 
j 


For fixed n,, the numbers of interviews n,, obtained (j= 1, 2, r) follow 
a multinomial with probabilities Wi;P;/2w,;p;. It follows that 


nW; 
E(n;; | n) = ES 
Wap; 


Hence, if 7, is the sample mean obtained after ; calls, 


Eg, |n) = e(2tto) _ Zmoraras E wapu 13.8) 
uh Ni È Wap; È Wapi 
Since this result does not depend on n; the unconditional mean of 9; İS 
also j4;, The bias in the estimate 7 is therefore (à; — à). 


The conditional variance of V; for given n; is found similarly to be 


x Wapilo? + (us — iy] 


VY; | n) = (13.9) 


E 
n; 22 Wisp; 
3 


The unconditional variance, i 
mately by replacin 
Finally, 


ignoring terms of order 1/n,2, is given approx!” 
8 7; in (13.9) by its expected value from (13.7). n 
the mean square error of the estimate obtained after i calls is 


MSEG, |Ð = VG, |i) + (a — A} (229 


The cost of making i calls must also be considered. The expected 
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number of new interviews obtained in the kth call is E(w,; — wy 4 P; 
Hence, if c, is the cost per completed interview at the kth call, the expected 
total cost of making / calls is n,C(/), where 


Ci) = cy X wyp; + cs D (vs; — p; occ c; X (Wis — Wir Ps 


Example. A population with three classes is shown in Table 13.8. The p; 
and w;; are intended to represent surveys in which a random adult is interviewed. 
At the first call the probabilities w;; of obtaining an interview are taken as 0.6, 


TABLE 13.8 
CHARACTERISTICS OF THE THREE CLASSES 
Class 
2 
0.50 
0.6 + (0.4)[1 — (0.1): |0.3 + (0.7)1 —(0.5):-1] |0.1 + (0.9)[1 —(0.8):7] 
50 
50 
TABLE 13.9 
NUMBER OF INTERVIEWS, COSTS PER INTERVIEW AND BIASES 
Number of Average 
Number of Calls Interviews Cost per I Hi 
Required Obtained Interview Bias Bias 
1 0.4257, 100 +1.118 +2.235 
2 0.77179 105 +0.711 +1.421 
3 0.8827, 108 --0.421  +0.842 
4 0.9337 110 +0.266  +0.532 
5 0.96079 114 +0.180  +0.360 


0.3, and 0.1 in the three classes. At the second and subsequent calls, the con- 
ditional probabilities of interviewing a person missed previously are 0.9, 0.5, and 
0.2. These figures were made higher than the corresponding probabilities at the 
first call in order to represent the effect of intelligent inquiry by the interviewer. 

The item being estimated is a binomial percentage close to 50%. Two sets of 
^; are considered (I, 1I). For simplicity, the within-class variances o;? = 
#; (100 — 1j) were all taken as 2500. The relative costs per completed interview 
at successive calls were those given in Table 13.6. 

Table 13.9 shows (a) expected total number of interviews obtained for a total 


of i calls, (b) the average cost of these calls per interview and (c) the bias (#; — à) 
in the estimate jj under assumptions I and II about the /;. 
Tn II, for example, the true population mean g is 54%. The mean ji obtained 
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from first calls is 56.235 %, giving the bias of +2.235% shown in the table. A 
policy that requires three calls reduces this bias to +0.842 %. one 

The values of MSE(g) obtained from a given expenditure of money ia 
compared for the different call-back policies. In the first compar ani E 
amount of money is sufficient to take n, = 500 if oniy one call is made. kom 
Table 13.9 the expected number of interviews obtained in the first call is E() T 
(500)(0.425) = 212.5. If two calls are made, this expected number must 
reduced to E(n;) = 212.5/1.05 = 202.4. to maintain the same cost, and similarly 
for 3, 4, and 5 call-backs. These values of E(n;) were substituted in equation 
(13.9) to give V(g) and hence MSE(;). 


TABLE 13.10 


VALUES OF MSE(j) ron DIFFERENT CALL-BACK POLICIES 
CosriNG THE SAME AMOUNT 


n, = 500 
(for first calls only) n, = 1000 n, = 2000 
Numberof —— ——— — — CNCESEMAT X. 
Calls No 
Required Bias I* II* I II I if 


1 118 — 130 169 74 10.9 42 8.0 
2 124 129 146 6.7 8.3 3.6 5.2 
3 IZ 12957 136 65 74 34 3.9 
4 ISOM mii 21354. 1» 6 6 69 33 3.6 
5 1355 135 138 68 69 3.4 3.5 


I es BE Pe ———— 


* These Tepresent populations with smaller (I) and greater (II) amounts of 


‘bias, as defined in Table 13.8. 


Table 13.10 presents the resulting MSE's for three amounts of expenditure, 
corresponding to n, = 500, 1000, 2000 for a single call. When n, = 500, the 
values of MSE(j) are also given for the “no bias" situation in which every 
Aj = 50. This column shows the effect of call-backs when they are unnecessary, 
since no bias results from confining the survey to a single call. 3 

The policies giving the lowest MSE's are shown in boldface type. Consider 
first the smallest sample size, n, = 500, If call-backs are unnecessary, a policy 
demanding as many as four call-backs results in only a modest increase in the 
MSE. In I, involving the smaller amount of bias, the different policies produce 
about the same accuracy, although three is the optimum. In II, three to d 
call-backs are satisfactor , a single call giving a MSE about 25% above the 
minimum, 


For the larger sample sizes the optimum number of call-backs increases to four 


or five, and the use of a single call results in more substantial losses of accuracy: 

This is, of course 
S that as informa 
economical policy 


; : d 
> only an ilustration. The importance of therapiam 
tion accumulates about costs and relative biases # 

can be worked out for any specific type of survey- 
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13.6 OPTIMUM SAMPLING FRACTION AMONG THE 
NONRESPONDENTS 


After the first attempt to reach the persons in the sample has been made, 
another approach, due to Hansen and Hurwitz (1946), is to take a random 
subsample of the persons who have not been reached and make a major 
effort to interview everyone in the subsample. This technique was first 
developed for surveys in which the initial attempt was made by mail, a 
subsample of persons who did not return the completed questionnaire 
being approached by the more expensive method of a personal interview. 

The first step is to take a simple random sample of n units, using the 
ordinary field methods. Let n, be the number of units in the sample that 
provide the data sought and n, the number in the nonresponse group. 
By more intensive efforts, the data are later obtained from a random 
sample of r out of the mg. Let 


n = kra (k > 1) (13.11) 


Then the average sampling fraction in the first stratum is k times that in 
the second. This follows because if k is fixed in advance 


el) el) ol) 


The values of n (initial size of sample) and k are chosen to give a speci- 
fied precision for the iowest cost. 
The cost of taking the sample is 


C = con + em + Cole 
Where the c's are costs per unit: c, is the cost of making the first attempt, 
€; is the cost of processing the results from the first attempt, and c; is the 
Cost of getting and processing the data in the second stratum. Since 
the values of n; and n, are not known until the first attempt is made, the 
expected cost is used in planning the sample. The expected values of 
m and r, are, respectively, Wn and Wan/k, where Wi, Ws are the true 
Proportions in the two strata. Thus expected cost is 
co Wn 
con + Wyn + XT (13.12) 
Let gj, Jo, be the sample means in the two strata. The subscript r is 
introduced as a reminder that the sample in the second stratum is of size 
7» As an estimate of the population mean, we take 


7 =A Oud. + mie) (13.13) 
n 
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Note that the second stratum receives a weight mo, although the sample 
is only of size r,. This is done in order to obtain an unbiased estimate. 

This procedure is an application of double sampling with stratification. 
The first or “large” sample of size n gives an estimate njln; of the relative 
size of the strata. The second or “small” sample is of size n, in the first 
Stratum and r, in the second stratum. 


To find V(y), write 
y= Ln + noten) + "M — Yon) (13.14) 


where Jan is the mean of the whole sample of size n, from stratum 2. 
The first term on the right is the mean of a random sample of size n 
from the whole population. Its variance is therefore , 


(N — n) S? 
N n 


where S? is the variance of the whole population. Further, when we find 


the variance of y’, there is no contribution from cross products between 
the first and second terms. For 


Elon — Yon)] = 0 


over all random samples of size r, that can be drawn from a fixed sample 


of size n. 


Consider the second term on the right of (13.14). If Y, is the population 
mean of the nonresponse stratum, we have 


(Yo, Fi Y,) = (J, — Yon) + Gon — Y,) 
so that 
EQ, — Y = Es, — Fon)? + Eon — Yo)? 
there being no contribution from cross-product terms for the same reason 
as before, Now Yor is the mean of a simple random sample of size rz fr Sha: 
the second stratum, and Yon is the mean of a simple random sample o 
Size n, from the same stratum. Hence, for fixed n, and r,, 


(N: — r3) S? Y K No — nj) S? 
ND mE, — dan)? + (Ns = n) Se’ 
2 To 2 no 
Where S,? is the variance within the nonresponse stratum. This gives 


- ü = —1 
Eo, — y, = se( = +) =s? UTE se EA 

Ty ny Nar ne 
since n = Kra: 
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Hence, adding the variances of the two terms in (13.14), we find, for 


fixed n, 
2 
v=% (je Ds? 


n n Ng 


— 2 — 
2(N-mS, Bus nS (13.15) 


N n 


Since E(n;) = nW,, this gives for the expected variance 


(13.16) 


(7) = (N — n) S? 
N 


4 EDM ge 
n n 

The first term is the variance that would be obtained if all n in the 
nonresponse group were sampled. The second term is the increase in 
variance from sampling only r of the nz. 

The quantities n and k are then chosen to minimize average cost (13.12) 
for a preassigned value of the expected variance (13.16). 

The solutions are 


M - [St — s — WAS) (13.17) 
kost = 
Sj (ey + cW) 


= NIS + (k — DWSEI 


opt qe (13.18) 


where V is the value specified for the variance of the estimated population 
mean. : 
The solutions require a knowledge of W,; this can often be estimated 
from previous experience. In addition to S?, whose value must be 
estimated in advance in any "sample size" problem, the solutions also 
involve S,?, the variance in the nonresponse stratum. The value of Sj? 
is naturally harder to predict; it will probably not be the same as S?. 
For instance, in surveys made by mail of most kinds of economic enter- 
prise, the respondents tend to be larger operators, with larger between-unit 
variances than the nonrespondents. 

If W, is not well known, a satisfactory approximation is to work out 
the value of ,,, for a range of assumed values of W, between 0 and a 
Safe upper limit. The maximum 7,,, in this series is adopted as the initial 
sample size n. When the replies to the mail survey have been received, 
the value of n, is known. The variance formula (13.15) is then solved to 
find the value of k that gives the desired variance V. The cost for this 
method is usually only slightly higher than the optimum cost which would 
have applied if W, were known. 
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Example. This example is condensed from the paper by Hansen and 
Hurwitz (1946). The first sample is taken by mail and the Tesponse rate W, is 
expected to be 50%. The precision desired is that which would be given by a 
simple random sample of size 1000 if there were no nonresponse. The cost of 
mailing a questionnaire is 10 cents, and the cost of processing the completed 
questionnaire is 40 cents. To carry out a personal interview costs $4.10. 

How many questionnaires should be sent out and what percentage of the 
nonrespondents should be interviewed? 

In terms of the cost function (13.12) the unit costs in dollars are as follows: 


ĉo = cost of first attempt =0.1 
€; = cost of processing data for a respondent = 0.4 
C2 = cost of obtaining and processing data 

for a nonrespondent = 4,5 


The optimum n and k can be found from (13.17) and (13.18). If the variances 
S? and S;* are assumed equal and N is assumed to be large, then 


= ed tW)" - (4.5)(0.5) = 
Fon -JZ TaW, = Js + (04x05) ^ 7.5 = 2.739 


_ SU (k — s] 
SS 


Mont 


1000{1 + (1.739)(0.5)} 
= 1870 


Note that we have put S?/V = 1000, or V = S?/1000, since this is the variance 


that the sample mean would have if a sample of 1000 were taken and complete 
Tesponse were obtained. 


Consequently, 1870 questionnaires should be mailed. Of the 935 that are not 


Oi we interview a random subsample of 935/2.739, or 341. The cost is 
95. 


As Durbin (1954) has pointed out, subsampling is unlikely to show à 
marked profit unless c, is large in relation to (cy + c,W;). The two 
quantities are comparable, since (co + e W,) is the expected cost per unit 
of making the first attempt and processing the results and c; has the same 
meaning for the second attempt. From the equations it can be shown 
that the ratio of the cost of obtaining a prescribed V with k = 1 (no 
Subsampling) to the minimum cost for optimum k is 


Se + oW, + coWy) =o + Wy + coWe . 


IG! — WSN + aW) + JS [Wis + aW) + fes Wa 


if S? and S, are approximately equal. If r is the ratio of cy to (cy + c, Wa) 
the cost ratio becomes 


1+rW, 
P 
m+ riy 


For instance, if p = 4, the cost ratio is 1.029 for W, = 0.5, 1.074 for 
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W, = 0.8, and 1.061 for W, — 0.9. If, however, S? is substantially 
greater than S,?, there is more to be gained from subsampling. 

With stratified sampling, the optimum values of the 7, and the k, in 
the individual strata are rather complex. A good approximation is to 
estimate first, by the methods in sections 5.5 and 5.6, the sample sizes Non 
that would be required in the strata if there were no nonresponse. Now 
from (13.18), if W, = 0, we have 

E TANSI 
° NV+ S? 
Hence (13.18) can be rewritten as 


Nopt = No [1 + ee 


This equation, applied separately to each stratum, gives an approxima- 
tion to the optimum n. The values of k, are found by applying (13.17) 
in each stratum. 

These techniques can be used with ratio or regression estimates. With 
the ratio estimate, the quantities S? and So? are replaced by S,? and S),2, 
where d; = y, — Rv; With a regression estimate, S? becomes S*1 — p?) 
and S,? becomes S,°(1 — p?). 


137 ADJUSTMENTS FOR BIAS WITHOUT CALI-BACKS 


An ingenious method of diminishing the biases present in the results of 
the first call was suggested by Hartley (1946) and developed by Politz and 
Simmons (1949, 1950) and Simmons (1954). Suppose that all calls are 
made during the evening on the six week-nights. The respondent is asked 
whether he was at home, at the time of the interview, on each of the five 
preceding week-nights. If the respondent states that he was at home ¢ 
nights out of five, the ratio (¢ + 1)/6 is taken as an estimate of the 
frequency z with which he is at home during interviewing hours. 

The results from the first call are sorted into six groups according to 
the value of z, (0, 1, 2, 3, 4, 5). In the ¢th group let n, be the number of 
interviews obtained and Fa the item mean. The Politz-Simmons estimate 
of the population mean uis 


= 6ngf(t + 1) 


Ips = (5 
X 6n,/(t + 1) 
t=0 


This approach recognizes that the first call results are unduly weighted 
with persons who are at home most of the time. Since a person who is 
at home, on the average, a proportion 7 of the time has a relative chance 
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a of appearing in the sample, his response should receive a weight lm. 
The quantity 6/(t + 1) is used as an estimate of 1/z. Thus Ips is less 
biased than the sample mean gj from the first call, but its variance. is 
greater, because an unweighted mean is replaced by a weighted mean with 
estimated weights. 

In presenting the mean and variance of jps, we use the notation of 
section 13.5. The population is divided into classes, people in the jth 
class being at home a fraction z; of the time. Note that the zth group 
(i.e., persons at home ¢ nights cut of the preceding five) will contain persons 
from various classes. Let 715, Yj, be the number and the item mean for 
those in class j and group ¢. Then jpg may be written as follows: 

Jes = Z2 Malt + 1) N Lay 
X È nalt + 1) D 
This is a ratio type of estimate. In large samples its mean is approximately 
E(N)/E(D). 
If n, is the initial size of sample (responses plus not-at-homes) and 7; is 


the number from class j who are interviewed, the following assumptions 
are made: 


a CUP MR d ; 
(i) is a binomial estimate of pyr; 
0 


(ii) E(n;, | n) = n; my > a) 


5! 
t!(S—1)! 
(iii) E(y,)— H; for any j and t 

Assumption (ii) is open to question. Without giving a detailed dis- 


cussion, it assumes that people Teport correctly whether they were at home. 
For given j, using assumption (ii), 


2 6 5 6 ! 
EYn ( E i Ere ti — ret 
à" rei) 2, f+inGom 74-79 


= u- - 9) 
g 


Hence 
E(D) = yi) {l-(1— 7,4) = n, Y pill — (t — a] 
IRE] : ji 
d sut mption (i). Further, since E) = yx, for any j and t, this gives 
X put — 0 — 2] 


1 


EQ@ps) = B pg = 


Šp- — 2 


j=1 
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Since the true mean i = X p,u;, some bias remains in ps. Ina certain 
sense, this estimate has the same bias as j;, the sample mean given by the 
call-back method with a requirement that as many as six calls be made if 
necessary. In section 13.5, equation 13.8, it was shown that the call-back 
method, with a total of ; calls, gives an unbiased estimate of p 
E wapuj[Y wisps, where W; is the probability that a person in class j who 
falls in the sample will be interviewed. Now Wi; = 7;. If at subsequent 
calls the probability of finding at home a person not previously reached 
remains at 7,, then 

wy = [1 — (1 — 7] 
So that ps = jis. However, with the call-back method the probability 
of an interview at a later call may be greater than 7, as a result of informa- 
tion obtained by the interviewer at the first or earlier calls. In this event 
the call-back method has less bias after six calls. 

The variance of Jpg is rather complicated. With the usual approxi- 
mation for a ratio estimate, it may be expressed, following Deming (1953), 
as 

VY ps) = a {> 7 p,B,lo;? + (u; — ps)? 

+ (n, — 1) X Gup2'G; — AX; — BpsY) 


where 
U=1—-¥ pi =r) 
4,7 p -ü-syl 
Tj 
6 2 5! t 5—t 
B; = —— | ——7;(l — r) 
2 a GE ems 


Although this expression is difficult to appraise without applying it to 
Specific populations, two comments can be made. If the x; do not differ 
greatly, that is, if the bias from first calls is moderate, the dominating 
term is the first: 

1 
n,U 


This expression tends to be 25 to 35% higher than the variance of the 
unweighted mean of the first calls. Also, V(ps) contains a term that 
does not decrease as m, increases and becomes important in very large 
Samples. : 
To summarize, comparisons made on simulated populations by Deming 
(1953), Durbin (1954), and this author suggest that this method shows to 
best advantage, in relation to call-backs, when the biases from early calls 


> 7;p;B,o j 
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are substantial and the sample is large. The reductions in MSE for the 
same outlay are small, however, unless call-backs cost substantially te 
than postulated here. The Politz-Simmons technique has the advantag 
of saving time. Errors and incompleteness in the values of t, not con- 
sidered in the analysis, are a disadvantage. The method may ene p 
applied, as suggested by Simmons (1954), in conjunction with severa 
call-backs. 

Several other methods for mitigating the “not-at-home” bias have been 
proposed. Bartholomew’s (1961) applies to a survey with two calls. He 
supposes that, for those not at home on the first call, the interviewer, by 
careful inquiry, can make the probability of finding them on the second 
call approximately equal. If this is so, the n, persons interviewed at the 
second call are a random subsample of the (n, — m) persons missed at 
the first call. Hence [njj, + (n, — nj)y;][n, is an unbiased estimate of 
the mean of the initial target sample. The method worked well on some 
British surveys to which Bartholomew applied it. In repeated surveys 
Kish and Hess (1959) suggest that nonresponses from recent surveys may 
Serve as a replacement for nonresponses in a current survey. Wherever 
the bias from early calls shows a systematic pattern, as in Table 13.1, 
Hendricks (1949) has outlined extrapolation methods to estimate the 
average results that would be given by nonrespondents. 


138 A MATHEMATICAL MODEL FOR ERRORS OF 
MEASUREMENT 


Conceptually, we can imagine that a large number of independent 
repetitions of the measurement on the ith unit are possible. Let y; be 
the value obtained in the «th repetition. Then 


Yia = Mi + ei, 
where u; = correct value, 


ĉia = error of measurement. 


The idea of a “correct value” 


; requires a little discussion. With some 
Items the concept is simple and 


concrete, For instance, in a inventory 
taken by sampling the correct value may be the number of fan belts lyIDE 
on a shelf at 12 noon on a Specified day. In some cases the correct value 
can be defined Operationally. A person's correct diastolic blood pressure 
at a specified time might be defined as the value obtained when it 15 
measured by a certain Standard instrument under carefully prescribes 

conditions, We may realize, however, that our standard instrument 1$ 
itself subject to errors of measurement, and we may expect that in course 
of time a more precise instrument will be developed. With other items: 
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for instance some aspect of an employee's attitude toward his employer 
or of a person’s feelings of ability to cope with his day-to-day problems, 
nobody may claim to have a satisfactory method of measuring the “correct 
value." Nevertheless, the concept is useful even in such cases. 

Under repeated measurements of the same unit, the errors e;, will 
follow a frequency distribution. For the ith unit, let e;, have mean f, and 
variance 0,7. The term f; represents a bias in the measurements. The 
magnitudes of f; and c? will, of course, depend on the nature of the item 
being measured and on the measuring instrument. They may depend also 
on numerous other factors. With human populations the prevailing 
economic and political climate and the amount and type of advance 
publicity received by the survey may influence the responses to the 
questionnaire. 

The next step is to consider how the errors of measurement change 
when we move from one unit to another. Various complications can 
arise. 

For the bias component f; there may be a constant bias, say, E(B,) = fi, 
that affects all units in the population. There will also be a component 
(B; — B) that follows a frequency distribution over the population. This 
component may be correlated with the correct value Hi; for instance, 
the measuring device may consistently underestimate high values of u; 
and overestimate low values. 

There may be a correlation between the values of €;, On different units 
in the same sample. The simplest example is the "interviewer bias." 
Dramatic differences are sometimes found in the mean values of Yia 
obtained by different interviewers who are sampling comparable parts 
of the same population [see Lienau (1941), Mahalanobis (1946), and Barr 
(1957)]. 

A similar effect has appeared when samples of a growing crop are cut 
by different teams and when chemical or biological analyses are done in 
different laboratories. The human factor is not the only cause for cor- 
relations among units that are measured at about the same time. Many 
measuring processes are affected by the weather; some use raw materials 
whose quality varies from batch to batch. In estimating the current sale 
price of homes built some years ago, Hansen, Hurwitz, and Bershad (1961) 
Point out that if some houses in the sample have been sold recently their 
Prices establish a level that guides the interviewer and the householder 
in assigning values to houses that have not been sold for many years. In 
fact, the average price recorded for the sample may depend on the order 
in which the recently sold houses appear in the sample. 

In order to handle these intrasample correlations in their most general 
terms, a more complex model than that presented here is required. In 
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particular, the notation for e;, and f; would have to indicate ED bet 
values may depend on the other units present in the sample. : is dos 
the types of correlation that are believed to be most common in p ines 
can be represented by the present model or by simple extensions a i 
The components of the error of measurement are summarize 
e 13.11. ed 
ee ic noted further that values of f; and d;, on different units in the 
same sample may be correlated with one another. 


TABLE 13.11 
COMPONENTS OF THE ERROR OF MEASUREMENT ON THE ith UNIT 
Symbol Nature of Component 
B Constant bias over all units 
B;—8 Variable component of bias, which follows some frequency 


distribution with mean zero as i varies and may be 
correlated with the correct value Hg 
di, = ei, — bi Fluctuating component of error, which follows some 
frequency distribution with mean zero and variance 
cj? as c varies for fixed i 
po UR a RON 


Models that are in general similar to the above have been developed 


by Hansen et al. (1951), Sukhatme and Seth (1952) and Hansen, Hurwitz, 
and Bershad (1961). 


13.9 EFFECTS OF CONSTANT BIAS 


Suppose that the measurements Vi 
bias f whose magnitude is unknown. 
sample is also subject to bias f. In 
we attach to the sample mean, the bi 
derived from a sum of Squares of 
usual computation of confidence lim 
no account of the bias, 
sampling. 

The situation is esse 
Consider the regressio. 


on all units are subject to a constant 
Then the mean j of a simple random 
the estimated error variance, which 
as cancels out, since this estimate 15 
terms (y; — jf. Consequently, the 
its for Y from the sample data takes 
The same results hold in stratified random 


ntially the same with regression and ratio estimates- 
n estimate 


Yr —y + W(X — z) 
Where both the y, and the x; may be subject to constant biases fy and 
B., respectively, Since the least Squares estimate b remains unchange 
and since the bias B, cancels Out of the term (X — 2), it follows that Zi! 
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subject to a bias f,. It is easy to verify that the sample estimate of V(7,,) 
contains no contribution due to the biases. 
With the ratio estimate 


X 


al Fe 


yg — 


the bia is also B,, to a first approximation, since in large samples E(X/Z) 
is approximately 1 even if the x, are subject to a constant bias. In large 
samples the sample estimate of variance 


220 0X — Ra) 
du o cm. 


will be almost free from bias as an estimate of 
E(gg — YP 
that is, as an cstimate of the variance about the biased mean Y. 

To summarize, a constant bias passes undetected by the sample data. 
As we have seen (section 1.7), the 95 % confidence probabilities are almost 
unaffected if the ratio of £, to the standard error of the estimated mean 
is less than 0.1, but as the ratio increases beyond this value the computation 
of confidence limits becomes misleading. Estimates of change from one 
time period to another, or from one stratum to another, remain unbiased, 
provided that the bias is constant throughout. 


13.10 EFFECTS OF ERRORS THAT ARE UNCORRELATED 
WITHIN THE SAMPLE 


If constant bias is ignored and errors of measurement are uncorrelated 
within the sample, the ordinary formulas for estimating the standard 
errors of sample estimates remain valid, provided that fpc terms are 
negligible. This result is proved for simple random sampling. 

The model can be written 


Yia = Mi + Pi + dia = Mi + dia (13.19) 
where u; = H4 + B, is the average value given by the measuring process 
on the ith unit. Since constant bias is ignored, 


E(B;) = 0, E(u;) = u = correct population mean 


We suppose, as holds in most surveys, that only one measurement is 
made on each unit. The sample means of y;,, 14^, and d;, are denoted by 
V. fi’, and d,. From (13.19) 


je v= —u) d, 


378 SAMPLING TECHNIQUES 


Since E(d;, | i) = 0, by the definition of d,,, it follows that y, is an 
unbiased estimate of y under simple random sampling. Further, 


(J. — uy = (à — uy + d? + 2d (à — p) (13.20) 
Hence Y 
V(9,) = E(à' — uy. + E(d?) (13.21) 
the cross-product term vanishing because the d;, have means zero for 
any i. 


Average first over repeated measurements on the same set of ñ units. 
Since the d;, on different units are independent, 


- 12 
E(d,)? = ni > a? 


When we average subsequently over all simple random samples, we have 


N 
— pÈ (u — uy N 
Vg) - 1—f5 ++ Se 
n N-1 


nN % 
=f s +lo? (13.22) 
n n 
where o} denotes the average of the variances of the errors of measure- 
ment. 


Note that S,? is the population variance of (4; + Bj). It will pores 
be larger than S,?, the population variance of the correct values IP thoug 
it can be smaller if 1; and B, are negatively correlated. If n = N, that is, 
the sample is a complete census, V(¥,,) does not vanish, because the errors 
of measurement give a contribution oN. ; 
Equation 13.22 produces an interesting result, due to Hansen, Hurwitz, 
and Bershad (1961), when a population proportion P is being estimated. 


For any unit, let the correct value 4, be 1 if the unit is in class C and zero 
otherwise, with 


N 
;— uy N 23) 
E(u) = P, Sr ST NO) (13. 
(u) u N—1 N—1 

If errors of measureme: 
Correctly classified. For a 
measurements is sometim 
tion of measurements o 


nt occur, this means that some units are in- 
ny such unit, the recorded value y,, in repeate x 
es 1 and sometimes 0. Let P, denote the propa 
n the ith unit for which y,, = 1. Then Yia 15 
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binomial variate with mean 4 = P, and the variance of d; is P;O,. 
Hence, if p, = gy, is the sample estimate, (13.22) becomes 


V(p,) = rier 4L Spe, (13.22) 
N 
e E A 
N N 
-r en Ne + ÈP XPA) 
S? 
= RE = (13.24) 


using (13.23), 

The quantity S,?/n is the variance that would be obtained for V(p,) if 
no errors in classification occurred and if n/N were negligible. When 
errors of classification are present, the result in (13.24) that V(p,) < S,?/n 
is reassuring. The result may appear paradoxical, however, since we 
might expect these errors to have more effect on V(p,). The explanation 
is that in the estimation of a proportion u; and f; are always negatively 
correlated. When u; = 1, f; < 0, since P; = (u; + Bj) € 1, and when 
H: = 0, B, > 0, since P; > 0. 

To revert to the case of a continuous variable, the ordinary formula 
for the sample estimate of VJ.) is 


1—f4. 1-fX0. = 90" 


n n n—i 


vy) = 
From (13.19) K 
Via — Ya = (p — B) + (dia — da) 
By squaring and averaging first over repeated measurements and then 
over different selections of the sample, we obtain 


E(s*) = |> (Yia zx =S, + og 
Hence «à 
Ev(y,) = Ls, =! os (13.25) 
By comparison with (13.22), we see that Nan has a negative bias of amount 
o,°/N. On the other hand, if the fpc term (1 — f) is omitted from v(y,), we 


obtain an overestimate of amount S,,?/N. mi : 
In the same way it can be shown that the formulas given in preceding 
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chapters for the sample estimates of error variances remain valid tog 
stratified sampling and for multistage sampling and that the approximate 
formulas for ratio and regression estimates remain valid in large samples, 
provided that errors of measurement in y;„ and Big ALE uncorrelated ir 
the sample and that the fpc's can be ignored. (The error in y;, may be 
correlated with that in the corresponding z;,.) 

* These results raise the question: in what kinds of surveys are errors of 
measurement uncorrelated within the sample? Since intrasample cor- 
relations can enter in the process of measurement, in copying the measure- 
ments, in editing and coding, particularly if subjective decisions are 
involved, and in transferring the data to tabulating equipment, the 
assumption that there is no correlation cannot be made glibly. The risk 
of correlation should, however, be minimized in surveys taken from 
records, in self-filled questionnaires (as in mail surveys) in which individuals 
in the same sample do not consult one another, and in surveys of inanimate 
populations in which the measurement is objective. 


1311 EFFECTS OF INTRASAMPLE CORRELATION 


BETWEEN ERRORS 
Under the model 


Yia = ki + B; dig mu! + di, 
some of the most common types of intrasample correiation can be re- 
Presented by supposing that values of d;, for units in the same sample are 


correlated. In finding V(y,), the analysis given in section 13.8 proceeds 


without change down to equation (13.21), 


V(y,) = E(a' — uy + E(d,2) (13.21) 
Now 
4 == (3.2 +23 didn) 
Ln ji 
Hence 
E 1 = 
E(d?) =- 02 + 2n(n — 1) E(diqd jx) (13.26) 
n 2n? 
Where the 


products are taken over a 


By analogy with cluster sampling, 
coefficient 


ll pairs of units in the same sample 


the average intrasample correlation 
Pw May be defined by the equation 


E(d;,d;,) = pagg" 
This gives, from (13.22) and (13.26), 


VO) = SA EEN Fa pa (13.27) 
n 
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The average value of v(¥,) is found in the same way to be 


I 


n 


Ey.) = = Sy? + aj (1 — pw) (13.28) 

Since p,, is likely to be positive for most types of measurement error, 
the standard formula v(y,) is usually an underestimate. Whether the 
underestimation is serious depends on the relative sizes of S,*, o? and 
np. 

This model represents only the simplest type of intrasample correlation. 
With stratified sampling, for instance, a coder may process results from 
Several strata and through a misunderstanding of instructions may 
introduce correlated errors that extend over the strata. The mathematical 
model can be adapted to apply to situations of this type. 


13.12 SUMMARY OF THE EFFECTS OF ERRORS 
OF MEASUREMENT 


In terms of the model, the mean jj of a simple random sample would be 
unbiased, with variance S,2/n (ignoring the fpc), if all measurements were 
fully accurate. As a result of the types of errors of measurement discussed 
here, the mean may be subject to a bias of amount f, and its mean square 
error is 


MSE) == {5,2 afl + (n — Da) P (13.29) 


where ki = ki + Bi 


Formula 13.29 contains two terms, S,2/n and o,°(1 — p,)/n, that 
decrease as l/n. The remaining two terms, p.c, and B®, appear at first 
sight to be independent of n. This is probably an oversimplification. 
Any material change in the size of sample may require a change in the 
field methods of measurement, and this may affect pw and f*. However, 
these two terms should change relatively slowly, if at all, with n. Thus in 
large samples the MSE is likely to be dominated by these two terms, the 
ordinary sampling variance becoming unimportant and misleading as a 
Buide to the real accuracy of the results. 


13.13 THE STUDY OF ERRORS OF MEASUREMENT 


In recent years much of the research on sampling practice has been 
devoted to the study of errors of measurement. The objectives are to 
discover the components that make a large contribution to the MSE and 
to find ways of decreasing these contributions. Some of the principal 
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methods are described in this and the following sections. It is already 
clear that progress will be slow and expensive. One reason is that, as 
already mentioned, the measurement errors depend intimately both on 
the items and on the measuring process. Results about measurement 
errors found in one survey can seldom be assumed to apply to other 
surveys. . 

Ideally, the best method of studying errors of measurement is to obtain 
the correct yalues y;. In practice, this approach is limited to items for 
which a feasible method of finding jj; exists and by problems of expense 
and execution. Examples are given by Belloc (1954), who compared data 
on hospitalization as reported in household interviews with the hospital 
records for the individual, and by Gray (1955), who compared employees' 
Statements of sick leave with the personnel office records. Checks of this 
type—sometimes called "record-checks"—are possible with items such 
as age, occupation, number of years of schooling, and price paid for car. 
One difficulty is that sometimes the records contain no exact match of the 
person interviewed. 

Failing a method of determining the correct value, an alternative is to 
remeasure by an independent method that is considered more accurate. 
Kish and Lansing (19 
selling prices of homes that had already been reported by the home 


owners. In surveys of illness respondents’ replies have been compared 
either with doctors’ 


TS, but the comparisons will at least 
tine instrument agrees well with the 
hich it does not. 

surveys is to reinterview a subsample 
of the method the reinterviewing team 
strument, the best interviewers being 
ng more detailed and probing. In this 
Scovering the items for which the first 
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underestimate o,°. If the time interval between the two visits is lengthened 
in order to minimize this effect, the respondent on the second occasion 
may not remember clearly his situation at the earlier time, so that the 
discrepancies overestimate o;*. 

Occasionally, over-all comparisons between the results of two different 
surveys are feasible. For a number of items, the results of the U.S. Census 
can be compared with those given by the Current Population Survey taken 
at the same time. Since the Survey is considered more accurate, particu- 
larly for items difficult to measure, rough estimates of the measurement 
bias f in the Census data can be made [Hansen, Hurwitz, and Bershad 
(1961). A number of comparisons between the results of quota samples 
and probability samples are discussed by Stephan and McCarthy (1958). 


13.14. INTERPENETRATING SUBSAMPLES 


This technique, particularly useful for the study of correlated errors, 
was proposed by Mahalanobis (1946). To present it in the simplest terms, 
a random sample of n units is divided at random into k subsamples, each 
subsample containing m = n/k units. The field work and processing of 
the sample are planned so that there is no correlation between the errors 
of measurement of any two units in different subsamples. For instance, 
suppose that the correlation with which we have to deal arises solely from 
biases of the interviewers. If each of k interviewers is assigned to a 
different subsample and if there is no correlation between errors of 
Measurement for different interviewers, we have an example of the 
technique. . 

With the same mathematical model, it is convenient to label the units 
by double subscripts. Let 

Via = Mi dija 


where i denotes the subsample (interviewer) and j the member within the 


Subsample. The fpc is ignored. fre Dee nies 
Since the ith subsample is a random subsample, it is itself a simple 
random sample of size m. Hence, by (13.27), the variance of its mean is 


VG) = + (S^ ofi (n — Dod 


where p,, is the correlation between the dija obtained by the same inter- 


viewer. Since errors are independent in the different subsamples, 


zy2lymy-lt[s-ceL-(m-—2)pe) (13:30) 
VY.) = x V(Yia) F {Sp + ol 
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From the sample results, we can compute an analysis of variance into 
the components “between interviewers (subsamples)" and “within inter- 
viewers.” It is easy to verify that the expected values of the mean squares 
work out as in Table 13.12. 


TABLE 13.12 
EXPECTATIONS OF THE MEAN SQUARES (ON A SINGLE-UNIT BASIS) 


Between 
interviewers 
(subsamples) 

Within 

interviewers 


Sw? + og] [Ll + (n — Deu] 


s^ pcg li Pu) 


This analysis contains two important results. By comparison with 
(13.30) we see that 5,?/n is an unbiased estimate of V(¥,). Thus the method 
provides an estimate of error that takes account of interviewer biases. 
The analysis also supplies information about Pw The F-ratio s,°/s,." 
gives a test of significance of the null hypothesis p, = 0. The quantity 
(s? — s,?)/m is an unbiased estimate of p,,0,2. This result enables us to 
answer the question: how much do the interviewer biases contribute to 
V9.4)? The situation in which there are no interviewer biases can be 
represented by putting p,, = 0 and assuming that c? remains unchanged. 
Under this supposition V(,) becomes 


7 1 
VG.) = 7 Su? + oj) 
An unbiased estimate is 


U'(G,) = i Si eis — 1 [m — 0s? + s] 
n » = = 
This is to be compared with s,2/n, an estimate of the actual VG) =. ie, 
An alternative model that produces the same results represents the bias 
of the ith interviewer by a term £g, which has mean zero and variance 
07’ as i varies, Thus 
Vos = Mis) + gi + chos (13.31) 


i 
prhete the d, are now uncorrelated with each other and with the gi For 
this model, 


V(3,) = “(8,4 + mo? zt ag) (13.32) 
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and s,? in the analysis of variance is found to be an unbiased estimate of 
AV(J,) as before. The situation in which interviewer biases are removed 
is represented by deleting g; and replacing o;? by (o;* + a,2) to leave the 
error variance of the interviewer's estimate of a single unit unchanged. 

As the name interpenetrating subsamples implies, it is essential that 
the subsamples be chosen at random. A common practice is to assign 
each interviewer to the units lying in a small geographic area near his 
home, in order to decrease travel costs. Any real differences between the 
averages jj, for different areas then appear in the analysis of variance as if 
they were interviewer biases. Thus s;?/n becomes an overestimate of V(7,). 

The technique extends to stratified and multistage sampling. If the 
sole interest is in an unbiased estimate of V(7,), all that is necessary is 
that the sample consist of a number of subsamples of the same structure 
in which we are sure that errors of measurement are independent in 
different subsamples. Strictly, this requires that different interviewing 
teams, supervisors, and data processors be used in different subsamples. 
If 7;, is the mean of the ith subsample, the quantity X(y,, — .)"/k(k — 1) 
is an unbiased estimate of V(g.), with (k — 1) df. This result holds because 
the subsample can be regarded as a single complex sampling unit, the 
sample being in effect a simple random sample of these complex units, 
With uncorrelated errors of measurement between different complex units. 
Consequently, the results in section 13.10 apply. 

Numerous applications of this method, sometimes called replicated 
Sampling, are described by Deming (1960), who has used the method 
extensively. For other discussions of its advantages, see Jones (1955) and 
Koop (1960). Travel costs of interviewers are usually increased by inter- 
penetration, but this can be mitigated if the sample is stratified into 
Compact areas. For instance, each stratum might consist of two random 
Subsamples, assigned to a different interviewer. Each interviewer is 
required to travel over the whole stratum instead of over only half the 
Stratum. Every stratum provides | df for the estimate of V(¥,). 


13.15 EXTENSION TO MORE COMPLEX PLANS 


In general, the interpretation of the analysis of variance depends on 
the nature of the plan and must usually be worked out separately. As an 
illustration, consider a two-stage stratified sample with L strata. In each 
Stratum n‘k primary units are selected, each of the k interviewers being 
assigned to a random subsample of n' primary units. The primary units 
are assumed to be of equal size. Since the population is compact, each 
interviewer works in every stratum. The Bengal Labour Enquiry 1941- 


1942 described by Mahalanobis (1946) and the health survey in the 


386 SAMPLING TECHNIQUES 


Arsenal district of Pittsburgh described by Horvitz (1952) were roughly 
of this type. ? p 

In terms of the alternative model in (13.31), the mean for the jth primary 
unit of the ith interviewer in stratum h may be written 


Trija = Ün + gi + Wri + hig — a) + drija 
where jg, is the correct mean for the stratum. As before, Hinz = dau + Pris 
It is assumed that (A; — i) averages to zero over the stratum and has 
variance o,”. The d,,,, are uncorrelated, with mean zero for any A, i, j, 
and variance o;?. 

The equation contains a new term w,;, denoting an interviewer x stratum 
interaction. There is not much evidence that this term is needed to make 
the model realistic, although, if there are marked differences between 
strata in economic level, an interviewer might show a differential bias in 
different strata. The w,, are assumed distributed with mean zero and 
variance os. - 

The expectations of the mean squares of the useful terms in the analysis 


of variance work out as follows. The unit of analysis is the mean per 
subunit for a single primary unit. 


df” E(ms) 


Between interviewers k-1 w | og? + 0g? + n'or? +n Loy? 
Interviewers x strata (k — 1L. — 1) sis? | 0z? + ag? + nors? 
Between pu within kL(n' — 1) og? 4. eg? 

interviewers 


This analysis enables us to estimate both the components c;s? and o7". 


If n/N is negligible, the “between interviewers” mean square gives an 
unbiased estimate of kn'LV(G,). 


Example. In Horvitz’ data L = 6, k = 18, n’ = 1. The mean squares (in 


Our units) were s,? = 0.0759. syg? = 0.0147, for number of persons ill per 
household during the preceding month. Estimate the contribution of interviewer 
biases to V(g,). 


In terms of the model, the analysis of variance gives 


S = 0.0759 ~ o? + o3? + ors? + 6072 
Srs? = 0.0147 ~ oe + og? + ors? 
The third term (between 
w=], 

The estimate of o,2 is (0.0759 — 0.0147)/6 = 0.0102. With n’ = 1, it is not 
Possible to estimate o,.?. To similate the situation in which there are no ane 
ties biases, we shall assume that the interviewer’s contribution to the cue 
9! measurement of a single subunit (household) has a variance o7? + c;^ b" 


pu within interviewers) cannot be computed, since 
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that these contributions are independent from household to household. Since 
the assignment of an interviewer in a single stratum contained about 26 house- 
holds, the rotal variance of a mean per primary unit would be 
cu? + ai? + gars + or) 
Unless v;5? = 0, we cannot estimate this quantity. The quantity 
0.0147 + (0.0102) = 0.0151 ~ ez? + og? + ors? + soor 


is an overestimate which should not be seriously wrong if azg? is small. This 
is to be compared with 0.0759, the actual variance on a primary unit mean 
basis. The contribution of interviewer biases to the total variance is about 
80%. The estimate given by Horvitz (1952), who used a different value for 
zg", is 72%. 


13.16 CONTROLLED EXPERIMENTS IMBEDDED 
IN SURVEYS 


An extension of the idea of interpenetrating subsamples is to build into 
a survey an experimental comparison of some aspect of the measuring 
process or field operations. Perhaps the oldest example is the split 
schedule. Two forms q, q' of the questionnaire are prepared, differing in 
the wording of certain questions or in the order of the questions. In one 
plan each questionnaire is given to a random half of the units in the 
sample or in a part of it. For any item, the mean difference Ja — Ye and 
its standard error can be computed and a test of the relative bias per- 
formed. If appropriate, the variance ratio ss, can be tested to see 
Whether one form gives more erratic responses than the other. An alter- 
native plan is to arrange the units in pairs so that both members ofa pair 
are expected to show similar responses. Each questionnaire is given to 
one member of each pair. If pairing is effective, this plan provides a more 
precise estimate of y, — Jy, although a less precise estimate ofthe variance 
ratio. | 

The idea is to make part of the survey à controlled experiment using 
the precautions, including randomization, that are typical of good d 
mentation. In more complex surveys considerable care must be E 
in planning to ensure that unbiased estimates of the effects that are o 
interest are obtainable. Two illustrations are given. . 

In the 1950 U.S. Census an experiment designed to estimate the effect 
of interviewer biases was conducted in 24 counties in two states. More 
than 700 interviewers participated. The counties were divided into 125 
areas of average population 6500. An area contained on the ein 
enough work for six interviewers, although the sizes of the areas varied. 
An area with enough work for k interviewers was divided into 2k subareas, 
two of which were allotted at random to each interviewer. Within each 
area the interviewer variance can be estimated by an analysis of variance 
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of the kind in Table-13.12. Descriptions and results are given by Hanson 
and Marks (1958) and Hansen, Hurwitz, and Bershad (1961). A 
In the example described by Durbin and Stuart (1954) the experiment 
was the principal purpose of the survey rather than incidental to it. The 
aims were to estimate (a) the effects of two.types of clustering of interviews 
„on response rates, costs of interviewing, and precision, and (b) the effects 
of interviewer biases in urban surveys in parts of six towns. Three survey 
agencies supplied two interviewers each for every town, making 3 x 2 x 6 
or 36 interviewers in the main part of the experiment. The sampling frame 
was the Electcral Register, one registration area being sampled in each town. 
For the unclustered sample (type 1), a systematic sample of 30 names 
was drawn from the Register for each interviewer. In type 2 a systematic 


TABLE 13.13 
DESIGN OF A SURVEY ON CLUSTERING AND INTERVIEWER BIASES 
Town 
Agency 
II ill IV V VI 
A 1 2 1 3 3 2 
B 2 3 2 1 I 3 
G 3 1 3 2 2 1 


sample of 15 names was drawn from each of two polling districts within 
the area. In type 3 the sample from each polling district was taken from 
a single street or group of small streets. Thus type 3 has the most cluster- 
ing, type 2 being intermediate. The over-all design, in a latin square 
pattern, is given in Table 13.13. 

In town II, for instance, the two interviewers from agency A worked 
on separate type-2 samples, and so on. This cell provides 1 df for “between 
interviewers” and 2 df for “between clusters within interviewers." The 
analysis of the effects of clustering, which has several points of interest, 
1s not given here. 

These techniques have two advantages. Since money specifically ear- 
marked for field research on measurement or sampling techniques 1$ 
seldom available, the only way of conducting the research may be to 
implant it in an ongoing survey. Moreover, its results are more likely tO 
apply to practical conditions if they are obtained in this way. Nevertheless; 
It is hard to simulate exactly the ordinary conditions of a survey when 
part of it consists of a controlled experiment. Even if efforts are made 
to conceal this fact, supervisors and interviewers are likely to be aware 
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that there is something special about part of their work, and a controlled 
experiment always disrupts the usual procedures in some way or other. 
Much can be done to ensure useful information if this problem is antici- 
pated and studied in advance. 


13.437 SUMMARY 


In regard to their effects on the formulas given in preceding chapters, 
nonsampling errors may be classified as follows: 


] 1. With noncoverage and nonresponse, the most important consequence 
is that estimates may become biased, because the part of the population 
that is not reached may differ from the part that is sampled. There is now 
ample evidence that these biases vary considerably from item to item and 
from survey to survey, being sometimes negligible and sometimes large. 
A second consequence is, of course, that the variances of estimates are 
increased because the sample actually obtained is smaller than the target 
sample. This factor can be allowed for, at least approximately, in selecting 
the size of the target sample. 

2. Errors of measurement that from unit to unit 
within the sample and average to zero Over the whole population are 
properly taken into account in the usual formulas for computing the 
standard errors of the estimates, provided that fpc terms are negligible. 
Such errors decrease the precision of the estimates, and it is worthwhile 
to find out whether this decrease is serious. 

3. If errors of measurement on different units in the sample are cor- 
related, the usual formulas for the standard errors are biased. The 
standard errors are likely to be too small, since the correlations are 
mostly positive in practice. This type of disturbance is easily overlooked 
and may often have passed unnoticed. 

4. A constant bias that effects all units alike is hardest of all to detect. 
No manipulations of the sample data will reveal this bias. 

As this chapter has indicated, the study of these problems is slow and 
difficult, Nevertheless, a good beginning has been made. Much ingenuity 
has been shown in devising techniques for the assessment and control of 
nonsampling errors. Although this is a field in which broad generalizations 
are hard to attain, information should gradually accumulate about the 
nature and magnitudes of errors of measurement in different types of 
survey. More needs to be learned also about what can be accomplished 
with good training and supervision of the interviewers, with pretesting, 
with mechanisms to control the quality of the field work, and with a 
postsurvey appraisal of the successes and weaknesses in the operation. 


are independent 
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EXERCISES 


uppose that, by field methods of different intensities, it is possible to 
MA oue uns consist of 60, 80, 90, or 95% of the whole popusi 
tion. For a percentage that is to be estimated, the true “response stratum mean: 
are: 60% stratum, 40.7; 80% stratum, 43.5; 90% stratum, 44.8; 95% stratum, 
45.4; last 5%, 59.0. (a) For a method that samples only the 60% stratum, tis | 
that the root mean square error of the estimated percentage for the whole 
population is 


V Q414]n) + 28.94 


where is the number of completed questionnaires obtained. (b) Show that a 
root mean square error of 5% cannot be achieved by a method with 60% 
response but can be obtained with slightly over 100 completed questionnaires for 
the methods that have a response of 80% or better. (c) If a root mean square 


error of 2% is prescribed, what methods can achieve it and what sample sizes 
are needed? 


13.2 In 13.1 (c) suppose that it costs $5 per completed questionnaire for the 
field method that has a 90% response. To obtain a completed questionnaire 
from the next 5% of the population costs $20. For a root mean square error 
of 2%, is it cheaper to use the method with 90% response rate or that with 95 % 
Tesponse rate? , 

13.3 A population consists of two strata of equal sizes. The probability P 
finding the respondent at home and willing to be interviewed at any call is 0. 


for persons in stratum 1 and 0.4 for persons in stratum 2. (a) In the notation of 
section 13.5 show that 


Wa =1-(0.1), we =1 — (0.6) 


(6) If the original sample size is n,, compute the total expected number of inter 
views obtained for 1, 2, 3, 4, and 5 calls. (c) If the relative costs per complete 
interview at the ith call are 100, 120, 150, 200, and 300 for i = 1, 2, 3, 4 ^ 
respectively, compute the average cost per interview for all interviews obtaine 
up to the ith call. (d) The money available for the survey is enough to pay Pr 
300 completed first calls. If the policy is to insist on i calls, what are the expecte 


total numbers of completed interviews that can be obtained for the same amount 
of money when i — 1, 2, 3, 4, 5? 


_13.4 In exercise 13.3 persons in stratum 1 have a mean of 40% for some 
binomial percentage that is being estimated and persons in stratum 2 have @ 
mean of 60%. (a) Compute the bias in the sample mean fori = 1, 2, 3, 4,5 pu. 
(b) Compute the variances of the Sample means for the cost situation in part (d)o 
exercise 13.3. (To save computing, the variance may be taken as 2600/n;, where 
n; is the expected total number of interviews obtained.) (c) Which policy give 
the lowest MSE? 

13.5 In section 13.6 (subsam; 


product VC, where V is the pre: 
cost, is 


pling of the nonrespondents) show that E 
scribed variance V(g’) and C is the expec 


S?C' + coW25,2 + kW S2C + T WS? 
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where 
S? = S? — W,S,’, C — c +W 


and that the minimum value of VC is 
(SVC + Ve, W,SS 
13.6 In a survey on poultry and pigs kept in gardens and certain small 
holdings (Gray, 1957) a postal inquiry with several reminders was followed by 


interviews of a subsample of nonrespondents. By advance judgment, k = 2 
was chosen (i.e., a 50% subsample). The following data were available after the 


survey for one important item, in the notation of exercise 13.5. 
fois, f9295, Aes SES 
co 


By finding VC for k = 2 and for the optimum k, determine whether k = 2 was 
a good choice. 

13.7 In a survey by the Politz-Simmons method 390 respondents in an initial 
sample of 660 were found at home on the first call. The numbers who stated that 
they were at home on 0, 1, -+ -, 5 of the five previous nights and the number 
answering yes to a question in the survey were as follows. 


o/s 1/5 2/5 3/5 4/5 Sl 


Number 14 35 55 74 94 118 


Yes answers 4 13 20 30 42 56 


Compute the Politz-Simmons estimate of the proportion of Yes answers n the 
population and compare it with the simple binomial estimate. 

13.8 A population with N — 6 contains three units for which the correct 
answer to a question is yes and three for which it is no. Owing to errors of 
measurement, the probability of obtaining a “yes” response on a yes unit is 
0.9. and the probability of obtaining a “no” response on a no unit Is also 0.9. 
(a) By working out the distribution of all possible responses for samples of 
size 2, show that the probabilities are 0.218, 0.564, and 0.218 that the sample 
Bives 0, 1, 2 **yes" responses. (b) Show that the variance of the estimated pro- 


portion of **yes" responses is 0.1090. Verify results (13.22)' and (13.24) in section 


13.10. (c) What would be the variance of the estimated proportion of “yes” 


responses if there were no errors of measurement ? 

13.9 In part of the 1942 Bengal Labour Enquiry (Mahalanobis, 1946) a 
random sample of about 175 families was taken in each of three strata. The 
sample in each stratum was divided into five random subsamples, each assigned 
to a different interviewer. The five interviewers worked in all three strata. For 
expenditure on food, the relevant part of the analysis of variance (on a single- 


family basis) is as follows: 
ly basis) dr S E(ms) 


Between interviewers dE ora Mox 3501s" + 10507" 
Interviews x strata 8 96 Oye + oe + 35o;s" 
Within subsamples 510 9.9 One + oq 


392 SAMPLING TECHNIQUES 
In the notation of section 13.15 the model for a single family is 


Yniia = Bn Ei + Wri + (H'nis — An) + drija 
Variances: oP or? [ og 
Verify the expressions given for E(ms) and estimate the proportion of the total 
variance of the mean that may be ascribed to enumerator biases. 
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Answers to Exercises 


1.5 $80,400 and $82,960. 

1.6 Theconfidence probability is about 0.054 (found from ż = —1.67 with 25 
degrees of freedom). This assumes that future receipts follow the same frequency 
distribution as the sample of 26 receipts. 

1.7 When the MSE is due entirely to bias, the estimate is always wrong by 1V MSE. 
The probability of an error > 1V MSE is therefore unity and the probability of an 
error > 1.96V MSE or > 2.576V MSE is zero. In Table 2.1, Pr (2 1V MSE) tends to 4. 


24 Y= 51,473. Probability about 0.9. 

2.5 Yes. o(Y)is 984. 

2.6 Y = 20,238, (f) = 849. 

27 (a). Public: Å = 15446. Private: R= 12.75. (b) Public: s(R) = 0.761. 
Private: s(R) = 0.727. For the fpc we take f = 100/468. (c) 14.2 < R < 16.7. 

2.8 Diff./se.u, = 2.71/1.186 = 2.28. P about 0.023. Note that the fpc is not 
used in computing S.€-aitt- 

2.9 (a) 9408, s.e. = 780; (b) 9472, s.e. = 1104. 

2.10 S.e. (in 1000's) = (a) 14,800; (b) 3900; (c) 3140. 

2.34 9.2. (a) 2.7; (b) 2.4. 

2.12. (a) n = 60, with 30 from each domain; (b) 
Owners in the sample lies anywhere between 20 and 60. W 
that this will happen is 0.54 (from the binomial tables). With n = 
is 0.94, 

2.14 (a) 420; (b) 490; (c) both are unbiased; (4) estimate (b). 


n = 80 will do if the number of 
With n = 80, the probability 
100, this probability 


3.2 1066, 1334 as given by the normal approximation, equation 3.17. 


3.3 Nearly conclusive. 
3.6 (a) 76.2 + 3.6%; (b) 1738 + 280 families. 
3.7 1789 + 268 families. 
3.8 As an exact result, 
V(Á) jonas, qe 
WA) Nn — (Qi + Pi) 
395 
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Now N, = N(1 — 7), and in large samples n, = n(1 — 7). These substitutions give the 
stated result. In order that V(A,)/V(A,’) be small, we must have z(1 — QJ/Q: large. 
This means that Q, rust be small: in other words, the proportion of domain ! that lies 
in class C must be large. For given Q,, m should be large. I , 

3.9 Allgive Ay = 13. By the hypergeometric, the probability of no units in C E 
the sample is 0.0601 for A, = 12 and 0.0434 for A, = 13. By the binomial, Py — 
0.4507 and V1 - fP, = 0.4114, giving Ay = 12.3. Page 59 gives 0.061 and 0.044. 

3.11 Estimate (5) seems more precise. 

3.12 The highest value is PQ/n as compared with PQ/mn by the binomial formula. 
This occurs when every cluster consists entirely of 1's or entirely of O's. The lowest 
value can be zero if every cluster gives the same proportion P. (This is possible only for 
certain values of P and m). 

3.13 Variance is 0.00184 by the ratio method and 0.00160 by the binomial formula. 

3.14 Average size of sample = m/P. 


4.1 735 houses. This sample size is needed for two-car households if P = 10%. 
4.2 About 260 sheets, 


4.3 (a) 2475; (b) 4950. 
44 n-21 (taking ¢ = 2). 


4.5 n= 484. For number of unemployed, the cv would be about 15%. 
4.6 62 more. 
47 (a)n = 278; (b) n = 2315; (c) n — 3046. 


2 
4.8 Ifa rectangular distribution is assumed within each class, we take S? = 0.0834 
or S = 0.295. This gives estimates of 230, 580, 2030, and 11,600 in the four classes. 


If the right-triangular distribution is used in the fourth class, we take S = 0.24h, giving 
9600 for this class. 


4.11 n,,, = ( 


UNS V 
2cV2n 


5.1 (a) Neyman allocation gives n, = 0.87, n, = 3.13. (5) There are three possible 

estimates under optimum allocation and nine under proportional allocation. Vp Get) = 
© = 0.167: V9.) = A = 0.583. (d) Formula 521 gives V, (g.) = 0.159. 
5.2 (a) n = 375, na = 625; (b) n, = 750, n, = 250. . 
5.3 RP = 181% for proportional allocation and 214% for optimum qoM 
5.5 The maximum relative increases occur when W, — W, and equal 0.030 fo 
r — 2 and 0.111 for r — 4. 

S6 (a) mn = 4, njn = 8; (b) n = 264, n = 88, n, = 176; (c) $1936. 


5.7 (a) $2288 against $1936, (6) No. The minimum field cost to reduce V to DE 
$2230. 


8.8 (a) m = 384, n, = 192; (b) n, = 400, m, = 1600; (c) m = 1200, n, = 2400- 
5.9 Fractional increase = 1. 
510 m = 541, n, = 313, n, = 146, 


5.12 In population 1, Vpr» = 0.143/n; V, — 0. 134/n. In population 2, Vpr 
0.0491/n, V,,, = 0.0423/n. TI ti E 


bout 
6% in population 1 as agai 


= 
P 


The reduction in variance from optimum allocation isa 
nst 14% in population 2. 


ee ee C—O 


ANSWERS TO EXERCISES 397 


5.14 (a) If we guess P, = 45%, P. = 25 %, Pa = 7.5% as a compromise, this gives 
n, = 268, n, = 116, n, = 16; (b) s.e. = 0.0225; (c) s.e. = 0.0241. 


5A.4 No. In each of the worst cases [Ov — Wh) Y? is (0.105)? = 0.011. Thus, 
with stratification, MSE(J,)), as given by formula (5A.6), is 0.0108 + 0.0110 = 0.0218. 
With simple random sampling, V() = 0.0177. 
] 5A.5 (a)n = 1024. The optimum allocation for the second variate (average amount 
invested) satisfies both requirements. (b) The allocation given by equation (5A.14), 
p. 124, with 4 = 0.09, satisfies both requirements for n = 1031. 

5A.6 W, = 0.728, W, = 0.272. S, = 1.806, S, = 4.698 (in the coded scale). (2) The 
(oem sample sizes are  — 0.5077, n; = 0.493n. (b) VG) = 31.95[n, Vo (4) = 
6.72/n. 


Ls a = 
5AJ (b) | VIU) dy = Í V2 — y) dy = 2V2[1 — (1 —a)?4]/3. Hence we 
( 


0 0 
want [1 — (1 — a3] = 4. 

5A.8 The optima are L = 7 for p = 0.95, L = 5forp = 0.9, and L = 4 forp = 0.8. 
Either L = 5 or L = 6 is a good compromise. 

5A.9 (a) Gain in precision is about 110%, (b) Gain from proportional stratification 
over simple random sampling is about 90%. 

5A.10 (a) 3.733, (b) 1.111, (c) 8.222. 


6.1 For the ratio estimate V(Yg) = N*(I — f)Sèln and for simple expansion 
V(Y) = N*(1 —f)S,2/n, where d = (y — Rx). For the sample of 21 households the 
estimates of S,2 and S, are as follows. Number of children, sa? = 0.49, sy? = 1.61; 
number of cars, s = 0.41, s,2 = 0.39; number of TV sets, sè = 0.51, s? = 0.45. The 
ratio estimate appears superior for children. 

6.2 Gain = 66%. At least 11 units by the ratio method. 

6.3 Quadratic limits (27,100, 29,870); normal limits (27,030, 29,700). 

6.5 Apply theorem 6.3 to the estimation of R = Y/X. With large samples, use 
Jl X if p < (cv of x)/2cv of y), and use 7/2 otherwise. | 

6.6 The MSE's are 46.5 for the separate ratio estimate and 40.6 for the combined 
ratio estimate. In both cases the contribution of bias to the MSE is negligible. 

6.7 For Lahiri's method, V(Y) = 40.1. 
= 116.21 millions. The relative variance is 0.00111, 


6.8 Estimated population total 
p lions. The estimate is within 1 s.e. of the 


So that the s.e. is (0.0333)(116.21) = 3.87 mil 
true total. 

6.9 The estimates are (a) 1896, (b) 1660, (c) 1689. In (c) we find y; = 298 Ware 
—1.38. Estimated s.e.’s are (a) 256, (5) 36.9, (c) 18.6. For the s.e. in (b) I used the for- 
mula se, — Y, V — fen Fen — Zelt where Fp is the ratio estimate of Y, that 
is, 1660. For the s.e. in (c) I used Y; = 1689. 


7 Estimate = 11,080; s.e. = 152 (including the fpc). 
7.2 No, since b is very close to 1. 

73 Y,- 28,77 + 570. The relative precision is 113%. 
7.4 27,751 + 694. 


7.6 For the difference estimate, V(7) = S2/n, for the linear regression estimate, 


398 SAMPLING TECHNIQUES 


V) = S êS FINS? + S). The regression estimate has the smaller variance, but its 
superiority is unimportant if S,*/S,? is small. 


73 V(Yirr) = 34.5, (Yi) = 103. 


8.1 Variances are 8.19 (systematic), 11.27 (simple random), 8.25 (stratified, 2), 
7.46 (stratified, 1). 


8.2 Vey, = 0.00141, Vran = 0.00340. n 
8.3 The systematic sample should be superior for the proportion of people g 
Polish descent, since this variable exhibits geographical stratification. It is likely to be 


inferior for proportion of children because the sampling interval, 1 in 5, coincides with 


the average size of a household. The same is true, though to a smaller extent, for 
proportion of males. 


8.4 The variances are as follows. Males, V,,, = 0.0204, V,,, = 0.0216; children, 
Virs = 0.0204, V,,, = 0.0776; professional, V,,, = 0.0192, V,,, = 0.0016. 


8.5 Actual variance = 8.19. Method (a) gives 11.29. For method (b) the estimated 
variance from a single sample is (1 — fXGa — Fi2)*/4, where Ñi, Fi are the means of the 
two halves. The average is 3.24. The serious underestimation is unexpected. 

8.7 Both variances are (k? — 1)/6. 


8.8 Simple random sampling is better unless n = 1 or k = 1. 
9.1 Relative costs of using the four types of unit are as 100, 90.1,79.7, and 77.8 
(taking the first unit as a standard.) 


9.2 Relative precision of the household is 211 % for the sex ratio and 38% for the 
proportion who had seen a doctor. 


9.3 Relative precision of the large unit is 0.566 with simple random sampling and 
0.625 with stratified random sampling. 


95 (a)M=5; (b)M =]. 


9.6 The optimum M should decrease because travel cost, which varies as Vn, 
becomes relatively less important when n increases. 


9.7 (a) 34,242; (b) 5534; (c) 6493. = 
9.9 (a) If the s.d. among large units in class A oc My. (b) If probability oc VMs. 
940 VCP.) = 1.75, V$,,) = 0.50, V(f,) = 0.33. 
10.3 (a) 2.00; (b) 2.13. 
10.3 (a) 165/n; (b) 148.5/n; (c) 132/n. 
10.4 (a)n = 660 fields; (b) n — 530 fields. Protein rei 
10.5 cc, = 8. 
10.7 (a) 0.93%; (b) 0.51 %; (c) 0.36%. 


10.8 (a) Either m, = 7 or m = 8 is suitable; (b) 89% for my — 7 and 93% for 
Fio = 8; (c) 86% for m, = 7 and 89% for m, = 8. 


quires fewer fields than yield. 


dine „The relative precision of III to II drops from 3.02to 2.75. If two sampling plans 
iffer primari 


imarily in their between-units contribution to the variance, the relative pre- 


GE 9f the superior plan will in general decrease as the ratio of the within-units 
Variance to the total variance increases, 


11.3 The explanation is, roughly speaking, that with these data the Y;/z; are more 
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stable than the Y,/M,. If we took z; = 55, 55, and 5$, the between-units contribution in 
method IV would vanish. 

114 Total variance: 0.00482 (Ia), 0.02337 (1I), 0.00554 dil). 

11.6 Estimated percentage 14.2 -+ 2.16. Estimated number 3540 + 540. 

11.7 Estimated percentage 13.9 + 2.49. 

11.9 (a) Total rooms, 29,400, total persons, 50,550, persons per room, 1.72; (b) s.e.'s: 
total persons, 2,440, persons per room 0.066. 


12 n = 267, n’ = 1320 or n = 268, n' = 1280. V(p,) with optimum allocation is 
6.67 when p, is in 24's. With single sampling, V(p) — 8.33. 

122 cc, > 9. 

12.44 n'» 16n. 

12.5 By formula (12.29), s.e. — 1.25. 

12.6 Per cent gains from the second to the sixth occasion are 5 
105, respectively. 

12.8 The values of n V(gq")/S? and n V(ja')/S? are as follows: # = i, p = 0.8: 0.885, 
0875; u = 1, p = 09: 0.843, 0.840; = b, p = 08: 0824, 0810; 4 7 h p-09: 
0.752, 0.746. 


0, 75, 91, 100, and 


13.1 (c) 90% response with 1047 completed questionnaires or 95% response with 701 


completed questionnaires. A D costs 

13.2 The method with 90% response costs $5235. That with 9976 response t 
$5.7895 per completed questionnaire, or $4058 total cost. 
13.3 (b) 0.6579, 0.815n;, 0.891570, 0.935170) 0.961170; 
d) 300, 288, 277, 267, 256. i 

134 (a) Bias (in % = —3.85, —2.15, —1.21, —0.69, —0.40; (b) variances are 
8.67, 9.03, 9.39, 9.74, 10.16; (c) four calls. Da Te 

13.6 Yes. VC for k = 2 is only about 2% over the EQUUM 

Ao. jeg o, 
13.7 Politz-Simmons estimate, 39.7%; binomial, epit 
13.8 (c) Variance — 0.1. were independent from family to 


13.9 If each enumerator's error of measurement T 4 2)/525 instead 
i à aa? + 01/525 inste: 
family, the variance of the sample mean would be (c + 04? + Os te About 55% of 


Of (ay? + o4? + 350, + 1050; 
the total variance. 


(© 100, 104, 108, 112, 117; 


2)/525. Enumerator biases contribu 
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compared with separate ratio es- 
timate, 170 
estimated variance, 171 
in estimating domain means, 148 
in stratified two-stage sampling, 320 
optimum allocation for, 174 
upper bound to relative bias, 169 
variance, 170 
Combined regression estimate, 201 
bias, 203 
compared with separate regression 
estimate, 203 
estimated variance, 203 
variance, 203 
Component of variance, between pri- 
mary unit means, 280 
Composite estimate, in repeated sam- 
pling, 351 
Compromise allocation of sample sizes 
to strata, 119 
Conditional distribution, of proportions, 
60 
worked example for proportions, 61 
Confidence limits, 12 
conditional, 60 
definition for attributes, 56 
effect of bias on, 14 
effect of nonresponse on, 357-359 
for hypergeometric distribution, 56 
for proportions or percentages, 56, 
60 
for ratio estimates, 164 
in simple random sampling, 26 
in stratified random sampling, 94 
validity of normal approximation, 38 


Consistency, definition, 20 
of mean of simple random sample, 
20 
of ratic estimate, 157 
of regression estimate, 190 
Controlled experiments, in studying er- 
rors of measurement, 387 
Controlled selection, 128 
Correction for continuity, 57, 63 
"Correct value," 374 
Correlation coefficient, in finite pop- 
ulations, 158 
intracluster, 210, 242, 316 
within a systematic sample, 210 
Correlogram, 219, 221 
Cost function, for number of strata, 
134 
in analytical surveys, 145 
in determining optimum probabili- 
ties of selecting primary units, 
313-318 
in determining optimum sampling 
fraction for nonrespondents, 367 
in determining optimum size of 
unit, 245 
in determining optimum subsam- 
pling fraction, 279 
in determining sample size, 82 : 
in double sampling for regression 
estimates, 334 
in double sampling with stratifica- 
tion, 328 
in stratified random sampling, 95 
Covariance of sample means, 24 
Cumulant, k4, 43 
Cumulative Vf rule, 130 
Current estimates, in repeated sam- 
pling, 342-351 


Degrees of freedom, effective num- 
ber in stratified sampling, 94 
Descriptive surveys, 4 
Domains of study, 33 
see also Analytical surveys 
Double sampling, applications to sur- 
veys, description, 118, 327 
of nonrespondents, 367-371 
Double sampling with ratio estimates, 
339 
estimated variance, 341 
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Double sampling with ratio estimates, 
optimum sample sizes, 340-341 
variance, 340 
Double sampling with regression es- 


timates, 334 

comparison with simple random 
sampling, 336 

estimated variance, 337 

optimum sample sizes, 337 

variance, 334-336 

Double sampling with stratification, 

328 

comparison with simple random 


sampling, 331-332 
estimated variance, 332 
estimation of proportions, 330 
optimum sample sizes, 330 
variance, 329-330 


E, average over all possible samples, 
11 


End corrections, 217-218 
Error, limits of, 72, 74, 75 
Errors in surveys, types of, 355 
Errors of measurement, 374 
effects of constant bias, 376 
effects of correlation between er- 
rors, 380 
effects of errors uncorrelated with- 
in the sample, 377 
mathematical model for, 374-376 
summary of effects, 381, 389 
techniques for studying, 381-389 
use of interpenetrating subsamples, 
383-387 
Estimates of population variances, 
effects of errors in Sj on precision 
of stratified sampling, 116 
for determining sample size, 77 
Expansion factor, 20 
Eye estimates, 190 


Fields costs, effect on optimum size 
of unit, 238, 247 
Finite population correction (fpc), 23 
for ratio estimates, 31, 158 
for regression estimates, 192 
in stratified random sampling, 91 
in two-stage sampling, 276 
rule for ignoring, 24 
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Frame, 7 
estimates when frame contains units 
not in population, 35-37 
use of directory plus area sample, 
360 


Geographic stratification, 101 
construction of strata, 132 
gains in precision from, 102 

Grid sample in two dimensions, 228 


Hard core, in nonresponse, 360 
Hypergeometric distribution, 55 
as conditional distribution, 61 
confidence limits for, 56 
charts and tables of (reference), 
31 
worked example for, 55 


Incomplete stratification, 278 

Inflation factor, 20 

Interpenetrating subsamples, 383 
estimated variance, 384 
in stratified sampling, 385-387 
variance of estimate, 384 

Interviewer bias, 375 
mathematical models for, 383-387 

Intracluster correlation, 210 

Item, definition, 19 


Knight's move latin square, 229 
Kurtosis, 44 
effect on sample variance, 43 


Latin squares, use in systematic sam- 
pling, 229 

Lattice sampling, 229 

Limits of error (tolerable), 72, 74, 
75 

Linear programming, in pps selec- 
tion without replacement, 263 

Linear regression estimate, see Re- 
gression estimate 

Listing of primary units, effect of 
listing cost on optimum prob- 
ability of selection, 314—318 

Loss due to errors in estimates, 82 

Loss function, 82, 120 


Mail surveys, 356, 367 
Margin of error, 74 
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ching (in repeated sampling of 

Me Bak population), 344, 346, 348 
Maximum likelihood estimates, 154 
Mean square error, definition, 15 

justification for use of, 15 

relation to variance and bias, 15 
Measure of homogeneity, 244 
Measures of size, 252 

optimum for single-stage cluster 

sampling, 255 

Multivariate ratio estimate, 184 


Neyman allocation, 97, 128 
best stratum boundaries for, 129 
Noncoverage, 360 
Non-normality, effect of stratification 
on, 43 
effect on confidence limits, 38 
effect on sample means, 40 
effect on sample variances, 43 
frequently encountered in sampling 
practice, 38 
Nonresponse, 355 
bias produced by, 356-359 
effect of call-backs, 361-366 
effect on confidence limits, 357-359 
effect on variance in Stratified sam- 
pling, 148-149 
‘optimum sampling fraction among 
non-respondents, 367-371 


Politz and Simmons’ method, 371— 
374 


reasons for, 359-361 
sample size needed, 358-359 
Normal distribution, 9 
as approximation to binomial, 57 
as approximation to hypergeometric, 
57 
as limiting distribution of sample 
Means, 38 
use in surveys, 11 
validity with means from continuous 
data, 38-47 
Notation, for effects of call 
cies, 363 
for errors of measurement, 374-376 
for Proportions, 49, 62 
for ratio estimates, 155 
for simple random sampling, 19 
for stratified Sampling, 88 


-back poli- 


Notation, for two-stage sampling, 275 
for variances of estimates, 25 


Optimum allocation in stratified sam- 
pling, comparison with propor- 
tional allocation, 98, 108 

comparison with simple random 
sampling, 98, 100 

determination from previous data, 
101, 131 

effect of deviations from the Op- 
timum, 114 

effect of errors in Sn, 116 T 

effect of errors in stratum sizes, 
116 

for fixed sample size, 97, 107 

for fixed total cost, 95, 107 

in analytical surveys, 145 

in sampling for proportions, 107 / 

requiring more than 100% sampling, 
102 

with double sampling, 330 

with more than one item, 118, 120 

with ratio estimates, 173 

Optimum allocation in stratified. two- 
stage sampling, 288 

Optimum per cent matched, in sam- 
pling on more than two occa- 
sions, 346 4 

in sampling on two occasions, 34 
Optimum size of subsample, primary 
units of equal size, 279 
Primary units of unequal size, 313- 
318 
Over-all sampling fraction, 307 


Percentages, estimation of, see Pro- 
portions, estimation of ic 
Periodic variation, effect on systema 
sampling, 218 " 
Pilot survey, use in estimating OP 
timum sampling and subsamplin® 
fractions, 283-285 b ri- 
use in estimating population Va 
ances, 78 3 
Politz and Simmons’ method for ha" 
dling nonresponse, 371-374 
Population, 5 
autocorrelated, 219 
in random order, 214 
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Population, natural (data from), 221- 
224 
sampled, 6 
target, 6 
two-dimensional, 228 
three-dimensional, 229 
with linear trend, 216 
with periodic variation, 218 
Precision, contrasted with accuracy, 15 
relative, 98 
specification of, 73 
Pretest, 8 
Primary sampling units (primary units), 
270 
Probability proportional to estimated 
size (ppes), in single-stage sam- 
pling, 252-255 
in two-stage sampling, 271-275, 
297-299, 305-307, 308-311, 312- 
313, 314-318 
optimum measure of size, 255 
selection without replacement, 260 
two-stage sampling without replace- 
ment, 321-322 
Probability proportional to size in 
single-stage sampling, 251 
compared with equal probabilities, 
256 
compared with ratio estimate, 256 
In stratified sampling, 259 


method of drawing sample, 251 
Selection. witho 


ut replac E 
s Placement, 260 
Probability Proportional to size in two- 
Stage Sampling, 295 


compared with equal iliti 
296, 309 q Probabilities, 


compared with 
selection witho 
322 


variance of unbiased estimate, 308 
variance of ratio estimate, 313 
Probability sampling, 
properties, 10 
Proportional allocation in 
sampling, 89, 108 
comparison with optimum alloca- 
tion, 98, 102, 108 
comparison with simple random sam- 
pling, 98 


ratio estimate, 309 
ut replacement, 321- 


definition and 


stratified 
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Proportional allocation in stratified sam- 
pling, comparison with stratifica- 
tion after selection, 135 

in sampling for proportions, 108 

rule for use of, 102, 109 

self-weighing sample obtained, 89 

variance, 91, 107 

Proportions, estimation of, 49 

effect of nonresponse, 357-359 

effect of population P on precision, 
52 

in cluster sampling, 64-67, 247-248 

in double sampling, 330 

in simple random sampling, 49-67 

in stratified random sampling, 106— 
109 

in two-stage sampling (units of equal 
size), 278-279 

in two-stage sampling (units of un- 
equal size), 311-313 

size of sample for, 71, 74 

with more than two classes, 60, 61 

Purposive selection, 11 


Quadratic confidence limits for ratio es- 
timate, 164 


Qualitative characteristics, 49 


see also Proportions, estimation ot 
Quota sampling, 136 


Raising factor, 20 
Random member of household, method 
of selecting, 360 
Random numbers, 18 
"Random point" method Of selection 
241 , 
Random sampling, 
dom sampling 
Rare items, large sample needed for. 54- 
Ratio estimate, 29, 155 í 
accuracy of a 
157, 159 
adjustments to decre 
as special 
mate, 190 
bias, 160-162 
compared with mean per unit, 
compared with pps selection, 
compared with re 
199 


see Simple ran- 


PProximate va riance, 


ase bias, 180 
case of regression esti- 


165 
256 
gression estimate, 
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Ratio estimate, compared with stratified 
sampling, 172, 256 ^ 
conditions under which unbiased, 
161 
confidence limits, 164 
consistency, 157 
estimated variance, 31, 163 
Hartley-Ross estimate, 176 
in analytical surveys, 18 
in cluster sampling, 65-67, 300, 308, 
311-313 
in double sampling, 339 
in stratified random sampling, 167 
in stratified two-stage sampling, 320 
Lahiri's method, 177 
multivariate, 184 
optimum allocation for, 173 
optimum conditions for, 158 
Standard error for comparison of 
two ratios, 181 
unbiased ratio-type estimates, 176 
upper bound to relative bias, 162 
variance, 30, 157 
see also Combined ratio estimate and 
Separate ratio estimate 
Record checks, 382 
Rectangular distribution, variance, 79 
Regression coefficient, in finite pop- 
ulations, 191, 192 
Regression estimate, 189-204 
accuracy of large-sample variance, 
196 
bias, 197-198 
compared with mean per unit, 199 
compared with ratio estimate, 199 
conditions under which unbiased, 
198 
effect of error in slope, 192 
estimated variance, 196 
in double sampling, 334 
in repeated sampling of the same 
Population, 342-352 


in stratified random sampling, 200, 
202 


uses, 189 

variance, 194 

with inefficient estimate of b, 198 
with preassigned slope, 190 


see also Combined Tegression esti- 


mate and Separate regression es- 
timate 


Reinterviews, in study of errors of 
measurement, 382 
Relative net precision, 235 
Relative precision (RP), method of 
computing, 102 : 
of optimum and general allocation, 
114 
of stratified random and simple ran- 
dom sampling, 98, 108, 137 : 
type of population giving large gains 
in precision, 100 
Repeated sampling of the same popula- 
tion, 341-352 
composite estimate, 351 
current estimates, 342—351 
estimates of change, 341-342, 348- 
352 
optimum per cent matched, 344 
replacement policy, 342 p 
samples drawn with no matching, 
342, 351 
sampling on more than two occasions, 
345 
sampling on two occasions, 342 
types of estimate wanted, 341 
use of nonrespondents from recent 
surveys, 374 
Replicated sampling, 385 
see also Interpenetrating subsamples 
Right-triangular distribution, variance, 
79 
Rotation sampling, see Repeated sam- 
pling of the same population 


Sampled population, 6 
Sampling fraction, 20 
first-stage, 276 
over-all, 307 
second-stage, 276 
Sampling on more than two occasions, 
345 
see also Repeated sampling of the 
same population 
Sampling on two occasions, 342 
see also Repeated sampling of the 
same population 
Sampling units (unit), definition, 7 4 
method of determining optimum unit, 
234-242, 244-247 
optimum measure of size, 255 
variance, 29 
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Sampling units (unit), with unequal 
probabilities, 252-260, 305-311, 
312-318 

Sampling without replacement, 19 

with unequal probabilities, 260-266, 
271-275, 321-322 

Sampling with replacement, 19 

Selection with arbitrary probabilities, 
in single-stage sampling, 252 

in two-stage sampling, 271-275, 297, 
305, 308, 320 
Self-weighting estimate, 89 
in two-stage sampling, 301, 302, 304, 
306, 308, 310 
Separate ratio estimate, 167 
compared with combined ratio es- 
timate, 170 
estimated variance, 171 
in stratified two-stage sampling, 320 
liability to bias, 168 
optimum allocation, 173 
variance, 168 
Separate regression estimate, 200 
compared with combined regression 
estimate, 203 
estimated variance, 202 
liability to bias, 202 
variance, 201, 202 

Short-cuts, in computation of standard 

errors, 142 
in computation of variance of ratio 
estimate, 173 

Simple expansion, 165 

comparison with ratio estimate, 165 

Simpie random sampling, 18 

confidence limits for sample mean, 26 

confidence limits for sample propor- 
tion, 56 

distribution of sample proportion, 
54, 55 


estimated variance of sample mean, 
25 


estimated variance of sample propor- 
tion, 51 

for classification 
two classes, 60 

method of drawing, 18 

Precision compared with stratified 
Tandom sampling, 98, 108, 137 

Sample size needed for means, 75 


into more than 
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Simple random sampling, sample size 
needed for proportions, 74 
unbiased sample proportion, 50 
variance of sample mean, 22, 27 
variance of sample proportion, 50 
Size of sample for specified limits of 
error, analysis of problem, 72 
by minimizing cost plus loss due to 
errors, 82 
Cox's method of two-stage sampling, 
77 
effect of nonresponse on, 358-359 
for comparisons between domains, 82 
for means over domains, 81 
in stratified random sampling, 103, 
109 
with continuous data, 75 
with more than one item, 79 
with proportions, 71, 74 
worked examples, 71, 75, 76 
Size of sample needed, for estimating 
optimum subsampling fraction, 283 
for normal approximation to confi- 
dence limits for continuous data, 
41 
for normal approximation to con- 
fidence limits of proportions, 57 
Skewed population, experimental sam- 
ples from, 39 
Skewness, coefficient of, 41 
effect of stratification on, 43 
effect on confidence limits, 40 
Split schedule, 387 
Square grid sample, 228 
Standard deviation, in finite popula- 
tion, 21-22 
Standard error, of difference between 
domain means, 37-38 
of difference between two ratios, 
181-184 
of domain mean, 33-34 
of domain mean in stratified sam- 
pling, 145-147 
of domain total, 34-37 
of domain tctal in stratified sam- 
pling, 147-149 
of estimated population total from 
simple random sample, 23 
of mean in cluster sampling, 242— 
244 
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Standard error, of mean in selection Stratification, with two criteria, 126 


with unequal probabilities without with two-stage sampling, 318-320 
replacement, 262, 321 Stratified random sampling, 87 
of mean in stratified sampling, 90-95 compared with simple random sam- 
of mean in three-stage sampling, pling, 93-95 í - 
286-287 comparison with pps selection, 2 
of mean of simple random sample, comparison with ratio estimate, 172, 
256 
22-23, 24-25 s . , r 
of mean of systematic sample, 208— comparison with systematic sampling, 
f 211-224 
211 "° $ 
of mean per element in-selection with fe limits for continuous 
unequal probabilities, 253-255, HD m 
260—264 construction of strata, 1 
of mean per element in two-stage PUES y" a 
i s - at? ^ 
SER SEHE Eh PEENES estimated variance of y,» 93 
Ei ; of P» 107 
i Wr us cluster sampling, estimation of totals and means over 
XE domains of study, 146 
of rom simple random sam- fob proportions, 1 te 
Drenthe in analytical studies, 145 
of proportion in stratified sampling, y 


in two-stage sampling, 288 
106-107 optimum allocation, 95-97 
of proportion in two-stage sampling, size of sample, 103, 109 
278-279 type of population for which gains 
in precision are large, 100 
variance of p,,, 107 
variance of 32591 
with one unit per stratum, 141 


of proportion over a domain, 62-63 

of ratio estimate, 30-31, 157-160 

of ratio estimate in stratified sam- 
pling, 168-170 


of ratio in two-stage sampling, 311— with ratio estimates, 167 
313, 320-321 with regression estimates, 200 
of regression estimate, 191, 194, 196 see also Strata, Stratification i 
of regression in stratified sampling, apes es AC Pm P RE 
200—203 or variance of estimate, 89— 
of sample standard deviation, 43 Stratum weight, W}, 88 
of total in population possessing some Suepopalation, 2 dents, 367= 
attribute, 51 u "El of nonrespondents, 
Steps in a sample surve $5 i 
Strata, 87 : d MPs (units of equal size), 
construction of, 128-133 ; 
? d * , Three- 
optimum number, 133-135 es Sampling 
n tek of sample, 135 Superpopulation, 214 505 
c i ing, advantages, 
best variable for, 100, 128 Systematic sampling, vadvantag 


comparisons with simple random 
effect of number of strat 


ct ta on pre- sampling, 209, 210, 213-224 , 
cision, 133 $ comparison with stratified sampling, 

effect on normality of variate, 43 211-224 

reasons for, 87 


effect of periodic variation, 218 


with double sampling, 328 end corrections, 217-218 
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Systematic sampling, estimation of the 

variance, 224—227 

in autocorrelated populations, 219 

in natural populations, 221 

in populations in "random" order, 
214 

in populations with linear trend, 216 

in single-stage cluster sampling, 263 

in two dimensions, 228 

in two-stage sampling, 278 

method of drawing, 207, 208 

recommendations about use, 230 

relation to cluster sampling, 207 

stratified systematic sampling, 227 

variance of estimate, 208 


Target population, 6 
Theory, function in sample surveys, 9 
Three-stage sampling, 285 
optimum sampling and subsampling 
fractions, 287 
variance of mean per third-stage 
unit, 286 
Travel costs, 95 
Two-dimensional population, unaligned 
Systematic sample, 228 
use of latin square principle, 229 
Two-phase sampling, see Double sam- 
pling 
Two-stage sampling (units of equal 
size), advantage of, 270 
definition, 270 Š 
notation for, 275 
optimum sampling and subsampling 
fractions, 279 
stratified sampling of the primary 
units, 288 
table for selecting optimum size of 
subsample, 282 
use of pilot survey, 283 
variance of estimated mean, 275 
estimated variance, 276 


Two-stage sampling (units of equal 
size), variance of estimated pro- 
portion, 278 

Two-stage sampling (units of unequal 
size), comparison of equal proba- 
bilities with pps selection, 296, 309 

estimation of proportions, 311 

optimum sampling and subsampling 
fractions, 313-318 

planning of the sample, 322 

ratio to another variable, 311-313 

ratio-to-size estimate, 300 

selection of primary units without 
replacement, 321 

units chosen with equal probabili- 
ties, 293-295, 300, 304 

units chosen with probability pro- 
portional to measures of size, 295— 
299, 305-309 

with stratification, 318, 320 

Two-way stratification, 126 


Unaligned systematic sample, 228 
Unbiased procedure or estimate, defi- 
nition, 11, 20 
Unit (sampling unit), definition, 7 
see also Sampling unit 
United . States Census, use of sam- 
pling in, 3 
Unrestricted random sampling, 18 
see also Simple random sampling 


Variance, definition of 5? and o2, 15, 
16 

Variance function, 244 

Variance of population, advance es- 
timates of, 77 

Variance. of sample estimates, see 
Standard error 
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